3. Features selections and optimization

3.1. New features

The objective of this part is to create additional feature to improve the algorithm efficiency.

The added features is:

\[ratio\_salary\_total\_payements = { salary \over total\_payments}\]
\[ratio\_bonus\_total\_payements = { bonus \over total\_payments}\]

#------------------------------------------------------------------
# Optimize Feature Selection/Engineering
#------------------------------------------------------------------

### Create new feature(s)

EnronDataFrame[myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL]=EnronDataFrame[myTools.CONST_SALARY_LABEL].combine(EnronDataFrame[myTools.CONST_TOTAL_PAYMENTS_LABEL],func=myTools.ratio)
EnronDataFrame[myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL]=EnronDataFrame[myTools.CONST_BONUS_LABEL].combine(EnronDataFrame[myTools.CONST_TOTAL_PAYMENTS_LABEL],func=myTools.ratio)

print(str("New feature {}:\n{}").format(myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL,EnronDataFrame.sort_values([myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL],ascending=False)[[myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL,myTools.CONST_SALARY_LABEL,myTools.CONST_TOTAL_PAYMENTS_LABEL]].head(10)))
print()
print(str("New feature {}:\n{}").format(myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL,EnronDataFrame.sort_values([myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL],ascending=False)[[myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL,myTools.CONST_BONUS_LABEL,myTools.CONST_TOTAL_PAYMENTS_LABEL]].head(10)))

myTools.pyplot_scatter(data=EnronDataFrame,label_x=myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL,label_y=myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL,fileName="ratioRepartition.png",show=True,title="DATA RATIO REPARTITION")

print("Selected Features statistics: ")
print(EnronDataFrame.describe().transpose())

# Add new feature to the selected feature list
features_list+=[myTools.CONST_RATIO_SALARY_TOTAL_PAYEMENTS_LABEL,myTools.CONST_RATIO_BONUS_TOTAL_PAYEMENTS_LABEL]

../_images/ratioRepartition.png

Note

Thanks to these new features, we can observe a potential outlier with very high bonus ratio : HANNON KEVIN P. By exploring database, HANNON KEVIN P is a POI. This graph confirm the new features interest.

3.1.1. ratio_salary_total_payements new feature:

ratio_salary_total_payements

Name

ratio_salary_total_payements

salary

total_payments

BERBERIAN DAVID

0.947950

216582.0

228474.0

DERRICK JR. JAMES V

0.893633

492375.0

550981.0

REDMOND BRIAN L

0.868294

96840.0

111529.0

HANNON KEVIN P

0.842772

243293.0

288682.0

RICE KENNETH D

0.832860

420636.0

505050.0

ELLIOTT STEVEN

0.807373

170941.0

211725.0

DODSON KEITH

0.690762

221003.0

319941.0

TILNEY ELIZABETH A

0.619285

247338.0

399393.0

CARTER REBECCA C

0.548226

261809.0

477557.0

JACKSON CHARLENE R

0.523533

288558.0

551174.0

3.1.2. ratio_bonus_total_payements new feature:

ratio_bonus_total_payements

Name

ratio_bonus_total_payements

bonus

total_payments

HANNON KEVIN P

5.196029

1500000.0

288682.0

RICE KENNETH D

3.465003

1750000.0

505050.0

ELLIOTT STEVEN

1.653088

350000.0

211725.0

DERRICK JR. JAMES V

1.451956

800000.0

550981.0

BELDEN TIMOTHY N

0.954262

5249999.0

5501630.0

ALLEN PHILLIP K

0.930997

4175000.0

4484442.0

DONAHUE JR JEFFREY M

0.913492

800000.0

875760.0

KITCHEN LOUISE

0.893078

3100000.0

3471141.0

HICKERSON GARY J

0.816603

1700000.0

2081796.0

COLWELL WESLEY

0.805183

1200000.0

1490344.0

3.1.3. Enron database statistics

Enron database

feature

count

mean

std

min

25%

50%

75%

max

salary

145.0

1.841671e+05

1.969598e+05

0.0

0.0

210500.000000

2.690760e+05

1.111258e+06

bonus

145.0

6.713353e+05

1.230148e+06

0.0

0.0

300000.000000

8.000000e+05

8.000000e+06

total_payments

145.0

2.243477e+06

8.817819e+06

0.0

91093.0

916197.000000

1.934359e+06

1.035598e+08

total_stock_value

145.0

2.889718e+06

6.172223e+06

-44093.0

221141.0

955873.000000

2.282768e+06

4.911008e+07

exercised_stock_options

145.0

2.061486e+06

4.781941e+06

0.0

0.0

607837.000000

1.668260e+06

3.434838e+07

restricted_stock

145.0

8.625464e+05

2.010852e+06

-2604490.0

0.0

360528.000000

6.989200e+05

1.476169e+07

ratio_salary_total_payements

145.0

1.497954e-01

2.023639e-01

0.0

0.0

0.096947

2.153626e-01

9.479503e-01

ratio_bonus_total_payements

145.0

3.386830e-01

5.809623e-01

0.0

0.0

0.228339

5.234695e-01

5.196029e+00

3.2. Features scaling

I have selected Naive Bayes and Support Vector Machines (SVM) as algorithm. SVM calculate Euclidean distance between points. If one of the features has a large range, the distance will be governed by this particular feature.

Based on Enron database statistics, we can identified several feature with high range like total_payments and total_stock_value.

So, we have to used feature scaling to improve algorithm performance.

3.3. Features selections

The objective of this part is to reduce if possible the feature number in order improve estimators accuracy scores or to boost their performance on very high-dimensional datasets. To select feature, I will use two sort of algorithm:

  • Decision tree based on feature importances

  • SelectKBest based on feature scores


### Features selections

### Extract features and labels from dataset for local testing
my_dataset = EnronDataFrame.to_dict(orient='index')
data = featureFormat(my_dataset, features_list, sort_keys = True)
data = preprocessing.MinMaxScaler().fit_transform(data)
labels, features = targetFeatureSplit(data)

print("Decision Tree:")
treeClf=tree.DecisionTreeClassifier()
treeClf.fit(features, labels)
importances = treeClf.feature_importances_
ordered_features_list=myTools.getOrderedFeature(importances=importances,features_list=features_list[1:],filterlower=myTools.CONST_MINIMUM_FEATURE_IMPORTANCE,show=True)
best_features_list=[myTools.CONST_POI_LABEL]+ordered_features_list
print(str("Keeped Features with Decision Tree: {}").format(best_features_list))

k=6
print(str("SelectKBest: (k={})").format(k))
kbestClf=SelectKBest(k=k)
kbestClf.fit(features, labels)
ordered_features_list=myTools.getOrderedFeature(importances=kbestClf.scores_,features_list=features_list[1:],filterlower=None,show=True)
best_features_list=[myTools.CONST_POI_LABEL]+ordered_features_list[:k]
print(str("Keeped Features with K Best: {}").format(best_features_list))
features_list=best_features_list


3.3.1. Decision tree

Based on Decision Tree algorithm, the feature are ordered by importance. By arbitrary rules, I will keep only feature with higher importance than 0.10.

  • exercised_stock_options:0.2762581173156713

  • total_payments:0.14123456790123456

  • ratio_salary_total_payements:0.1321781305114638

  • restricted_stock:0.12452633105725908

  • total_stock_value:0.11033950617283952

  • bonus:0.1091318137409264

  • ratio_bonus_total_payements:0.06396116293023506

  • salary:0.042370370370370364

The keeped Features with Decision Tree is :

  • poi

  • exercised_stock_options

  • total_payments

  • ratio_salary_total_payements

  • restricted_stock

  • total_stock_value

  • bonus

3.3.2. SelectKBest

Based on SelectKBest algorithm, the feature are ordered by scores. By arbitrary rules, I will keep only the six first feature.

  • exercised_stock_options:24.815079733218194

  • total_stock_value:24.182898678566882

  • bonus:20.792252047181538

  • ratio_bonus_total_payements:20.715596247559944

  • salary:18.289684043404513

  • restricted_stock:9.212810621977086

  • total_payments:8.76567569005008

  • ratio_salary_total_payements:2.6874175908440368

The six keeped features with SelectKBest is :

  • poi

  • exercised_stock_options

  • total_stock_value

  • bonus

  • ratio_bonus_total_payements

  • salary

  • restricted_stock

3.3.3. Conclusion

In both algorithm, new ratio features is significant. It could be an interesting new feature. I keep the feature based on score (SelectKBest).

Note

By test, the score based on SelectKBest features list is higher than the Decision tree:

  • SelectKBest: GaussianNB, accuracy score: 0.86, precision score: 0.49, recall score: 0.32, f1 score: 0.37

  • Decision tree: name: GaussianNB, accuracy score: 0.85, precision score: 0.35, recall score: 0.26, f1 score: 0.28