lunes, 22 de junio de 2020

COVID Random Forest

Variables use for COVID Random Forest for Colombia

Database with the following variables:
Atencion:{Casa:1, Fallecido:2, Hospital,:3 Hospital UCI:4, Recuperado:5}
Tipo:{En estudio:1, Importado:2, Relacionado:3}
Tipo_recuperacion:{PCR:1, Tiempo:2}
Sexo:{M:0, F:1}
Estado:{Leve:1, Asintomatico:2, Grave:3, Fallecido:4, Moderado:5}

Predictors: 'Atencion', 'Sexo', 'Tipo', 'Estado'
Target: 'Tipo_recuperacion'

Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

#Base de datos completa
import pandas as pd
arbol=pd.read_csv('COVID/COVID_coursera_v3.csv',encoding='latin1', delimiter=';')
arbol.head()
arbol.describe()

predictors = arbol[['Atencion', 'Sexo', 'Tipo', 'Estado']]
targets = arbol.Tipo_recuperacion
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=20)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

#Running a different number of trees and see the effect of that on the accuracy of the prediction
trees=range(20)
accuracy=np.zeros(20)
for idx in range(len(trees)):
   classifier=RandomForestClassifier(n_estimators=idx + 1)
   classifier=classifier.fit(pred_train,tar_train)
   predictions=classifier.predict(pred_test)
   accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)

Results:

Confusion Matrix
array([[6270, 2225],
       [6141, 2170]])

Accuracy model score: 0.5022015946685707

Model features importance: [0.21062097 0.09510638 0.40366751 0.29060514]
                             ='Atencion', 'Sexo', 'Tipo', 'Estado'

X:Number of trees vs Y:Accuracy model score


Interpretation:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as  possible contributors to a random forest evaluating Tipo_recuperación, Atencion, Sexo, Tipo and       Estado.
The explanatory variables with the highest relative importance scores were Tipo, Estado, Atencion and    the lowest was Sexo. The accuracy of the random forest was 50.22%, with the subsequent growing of       multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting  that interpretation of a single decision tree may be appropriate. Between 2 and 5 trees the accuracy of the model is improve a little.

No hay comentarios:

Publicar un comentario

Covid 19 Práctica Rmarkdown

covid covid Julian Uribe 2023-12-05 ## ── Attaching core tidyverse...