Database with the following variables:
Atencion:{Casa:1, Fallecido:2, Hospital,:3 Hospital UCI:4, Recuperado:5}
Tipo:{En estudio:1, Importado:2, Relacionado:3}
Tipo_recuperacion:{PCR:1, Tiempo:2}
Sexo:{M:0, F:1}
Estado:{Leve:1, Asintomatico:2, Grave:3, Fallecido:4, Moderado:5}
Predictors: 'Atencion', 'Sexo', 'Tipo', 'Estado'
Target: 'Tipo_recuperacion'
Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
#Base de datos completa
import pandas as pd
arbol=pd.read_csv('COVID/COVID_coursera_v3.csv',encoding='latin1', delimiter=';')
arbol.head()
arbol.describe()
predictors = arbol[['Atencion', 'Sexo', 'Tipo', 'Estado']]
targets = arbol.Tipo_recuperacion
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=20)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
#Running a different number of trees and see the effect of that on the accuracy of the prediction
trees=range(20)
accuracy=np.zeros(20)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)
Results:
Confusion Matrix
array([[6270, 2225], [6141, 2170]])
Accuracy model score: 0.5022015946685707
Model features importance: [0.21062097 0.09510638 0.40366751 0.29060514]
='Atencion', 'Sexo', 'Tipo', 'Estado'
X:Number of trees vs Y:Accuracy model score
Interpretation:
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating Tipo_recuperación, Atencion, Sexo, Tipo and Estado.The explanatory variables with the highest relative importance scores were Tipo, Estado, Atencion and the lowest was Sexo. The accuracy of the random forest was 50.22%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate. Between 2 and 5 trees the accuracy of the model is improve a little.
No hay comentarios:
Publicar un comentario