4. Select and tune algorithm¶

I test two different algorithm: Gaussian Naive Bayes and Support Vector Classification with and without GridSearchCV.

#------------------------------------------------------------------
# Pick and Tune an Algorithm
#------------------------------------------------------------------

data = featureFormat(my_dataset, features_list, sort_keys = True)
data = preprocessing.MinMaxScaler().fit_transform(data)
labels, features = targetFeatureSplit(data)

# split features and labels for training and test. If n_splits==0, the method used is train_test_split else is StratifiedShuffleSplit
# Warning : features_train, features_test, labels_train, labels_test is list of element. In case of train_test_split there are only one element
features_train, features_test, labels_train, labels_test = myTools.getTrainTestDataList(features, labels, test_size=0.3, random_state=42, n_splits=100)

### GaussianNB
clfList=list()
name="GaussianNB"
print(str("{}:").format(name))
labels_predict=list()
myClassifier = myTools.classifier(name=name,clf=GaussianNB())
for i in range(len(features_train)):
	myClassifier.clf.fit(features_train[i], labels_train[i])
	labels_predict.append(myClassifier.clf.predict(features_test[i]))
myClassifier=myTools.score(myClassifier=myClassifier,labels_test_list=labels_test,labels_predict_list=labels_predict)
print(myClassifier)
clfList.append(myClassifier)


### Support Vector Classification without GridSearchCV
c=10
gamma='auto'
kernel='rbf'
name=str("SVC Support Vector Classification: (c:{},gamma:{},kernel:{})").format(c,gamma,kernel)
print(str("{}:").format(name))
labels_predict=list()
myClassifier = myTools.classifier(name=name,clf=SVC(C=c,gamma=gamma, kernel=kernel))
for i in range(len(features_train)):
	myClassifier.clf.fit(features_train[i], labels_train[i])
	labels_predict.append(myClassifier.clf.predict(features_test[i]))
myClassifier=myTools.score(myClassifier=myClassifier,labels_test_list=labels_test,labels_predict_list=labels_predict)
print(myClassifier)
clfList.append(myClassifier)


### Support Vector Classification with GridSearchCV
name=str("SVC Support Vector Classification Tuning with GridSearchCV")
param_grid = {
	'C': [1,10,100,1000],
	'gamma': ['auto','scale'],
	'kernel': ['rbf','linear','poly','sigmoid']
}
print(str("{}:").format(name))
labels_predict=list()
myClassifier = myTools.classifier(name=name,clf = GridSearchCV(SVC(),param_grid,cv=5,iid=False))
for i in range(len(features_train)):
	myClassifier.clf.fit(features_train[i], labels_train[i])
	labels_predict.append(myClassifier.clf.predict(features_test[i]))

print(str('Best estimator found by grid search: {}').format(myClassifier.clf.best_estimator_))
print(str('Best estimator score found by grid search: {}').format(myClassifier.clf.best_score_))
myClassifier=myTools.score(myClassifier=myClassifier,labels_test_list=labels_test,labels_predict_list=labels_predict)
print(myClassifier)
clfList.append(myClassifier)

print("Select most powerfull classifier:")
clfList.sort(reverse=True)
if len(clfList): print(clfList[0].toCSV(header=True))
for classif in clfList:
	print(classif.toCSV())

clf=clfList[0].clf

Independently of the algorithm, the number of data ( 145 ) with 30% on test generate only 101 data for train and 44 data for test. I will be wise to use StratifiedShuffleSplit to increase the number of date used for train and test. By arbitrary choice, the number of split is set to 100.

Warning

In this script the number of data is 131 ( 14 data removed) . Due to 30% on test, there are 91 data for train and 40 for 40. This data reduction is link to hide option of featureFormat function: remove_NaN=True and remove_all_zeroes=True

Note

I realize the algorithm without StratifiedShuffleSplit and the accuracy score is high but f1 score is low. (eg: name: GaussianNB, accuracy score: 0.82, precision score: 0.33, recall score: 0.4, f1 score: 0.36)

Note

To sort algorithm by efficency, I compute an algorithm point based on sum of each score. I add also an important malus if the score is lower than 0.3.

Algorithm comparison sort by efficiency¶
name	accuracy score	precision score	recall score	f1 score
GaussianNB	0.86	0.49	0.32	0.37
SVC Support Vector Classification: (c:10,gamma:auto,kernel:rbf)	0.89	0.45	0.11	0.17
SVC Support Vector Classification Tuning with GridSearchCV	0.88	0.34	0.1	0.15

Warning

Warning raised during execution:

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples. can be raise depending.

Depending on train data, the algorithm predict no POI. So, classification module raise warning to inform users.