martes, 23 de junio de 2020

COVID K-means


VARIABLES:

'ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'TID'

DATA: INS (National Health Institution) Colombia.

CODE:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
import pandas as pd
data=pd.read_csv('COVID/COVID_coursera_v4.csv',encoding='latin1', delimiter=';')
data.head()
data.columns = map(str.upper, data.columns)
list(data.columns)
cluster=data[['ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'TID']]
cluster.describe()
clustervar=cluster.copy()
from sklearn import preprocessing
clustervar['ATENCION']=preprocessing.scale(clustervar['ATENCION'].astype('float64'))
clustervar['SEXO']=preprocessing.scale(clustervar['SEXO'].astype('float64'))
clustervar['TIPO']=preprocessing.scale(clustervar['TIPO'].astype('float64'))
clustervar['ESTADO']=preprocessing.scale(clustervar['ESTADO'].astype('float64'))
clustervar['TIPO_RECUPERACION']=preprocessing.scale(clustervar['TIPO_RECUPERACION'].astype('float64'))
clustervar['TID']=preprocessing.scale(clustervar['TID'].astype('float64'))
clustervar.head()
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]
for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
    / clus_train.shape[0])
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
from sklearn.decomposition import PCA
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 5 Clusters')
plt.show()
clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))
newlist
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
newclus.columns = ['cluster']
newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
edad_data=data['EDAD']
edad_train, edad_test = train_test_split(edad_data, test_size=.3, random_state=123)
edad_train1=pd.DataFrame(edad_train)
edad_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(edad_train1, merged_train, on='index')
sub1 = merged_train_all[['EDAD', 'cluster']].dropna()
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='EDAD ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
print ('means for EDAD by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['EDAD'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())


RESULTS:

A k-means cluster analysis was conducted to identify underlying subgroups of people based on their similarity of responses on 6 variables that represent characteristics that could have an impact on EDAD (age). Clustering variables included three binary variables measuring whether or not the person is male or female and if it is recovered or deceased and if for his recovery the person used PCR or just time, as well as quantitative variables measuring Time from symptoms appeared to diagnosis, a scale measuring ESTADO (Status), and scales measuring TIPO. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=29409) and a test set that included 30% of the observations (N=12604). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions

The elbow curve was inconclusive, suggesting that the 3, 5 and 8-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.
Canonical discriminant analyses was used to reduce the 6 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 3 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have more than 3 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 5 clusters.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
 The means on the clustering variables showed that, compared to the other clusters, EDAD in cluster 0 had moderate levels on the clustering variables. They had a relatively high likelihood of ATENCION and TID, but moderate levels of SEXO and TIPO. They also appeared to have fairly low levels of ESTADO, cluster 1 had higher levels on ESTADO variables compared to cluster 0. On the other hand, cluster 2 had the highest likelihood in TIPO, and the lowest levels of SEXO and TID. 

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA (F=175.5, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on GPA. People in cluster 1 had the highest GPA (mean=40.37, sd=19.41), and cluster 2 had the lowest GPA (mean=35.10, sd=17.08).





COVID Lasso regression

VARIABLES: 

['ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'EDAD', 'TID']

Predictors:['ATENCION','SEXO','TIPO','ESTADO','TIPO_RECUPERACION','EDAD']

Target: TID (Time from symptoms to diagnosis)

CODE:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
import pandas as pd
data=pd.read_csv('COVID/COVID_coursera_v4.csv',encoding='latin1', delimiter=';')
data.head()
data.columns = map(str.upper, data.columns)
list(data.columns)
data_st=data.copy()
from sklearn import preprocessing
data_st['ATENCION']=preprocessing.scale(data_st['ATENCION'].astype('float64'))
data_st['SEXO']=preprocessing.scale(data_st['SEXO'].astype('float64'))
data_st['TIPO']=preprocessing.scale(data_st['TIPO'].astype('float64'))
data_st['ESTADO']=preprocessing.scale(data_st['ESTADO'].astype('float64'))
data_st['TIPO_RECUPERACION']=preprocessing.scale(data_st['TIPO_RECUPERACION'].astype('float64'))
data_st['EDAD']=preprocessing.scale(data_st['EDAD'].astype('float64'))
data_st['TID']=preprocessing.scale(data_st['TID'].astype('float64'))
data_st.head()
predvar= data_st[['ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'EDAD']]
target=data_st.TID
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
dict(zip(predictors.columns, model.coef_))
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.mse_path_, ':')
plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1),'k',label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

RESULTS:
A lasso regression analysis was conducted to identify a subset of variables from a pool of 6 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring TID (Time from symptoms to diagnosis). Categorical predictors included gender (SEXO) and a series of categorical variables which are ATENCION, TIPO, ESTADO, TIPO_RECUPERACION to improve interpretability of the selected model with fewer predictors. Quantitative predictor variable include EDAD. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations (N=29409) and a test set that included 30% of the observations (N=12604). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 6 predictor variables, all were retained in the selected model. During the estimation process, TIPO and ATENCION were most strongly associated with TID, followed by EDAD. TIPO and ESTADO were negatively associated with TID. ATENCION and EDAD were positively associated with TID. These 6 variables accounted for 7.97% of the variance in the TID response variable.
{'ATENCION': 0.1849622532820641,
 'SEXO': 0.01907117926177939,
 'TIPO': -0.2158063669582188,
 'ESTADO': -0.08533458994192221,
 'TIPO_RECUPERACION': 0.011996060873158066,
 'EDAD': 0.10939438598178146}


lunes, 22 de junio de 2020

COVID Random Forest

Variables use for COVID Random Forest for Colombia

Database with the following variables:
Atencion:{Casa:1, Fallecido:2, Hospital,:3 Hospital UCI:4, Recuperado:5}
Tipo:{En estudio:1, Importado:2, Relacionado:3}
Tipo_recuperacion:{PCR:1, Tiempo:2}
Sexo:{M:0, F:1}
Estado:{Leve:1, Asintomatico:2, Grave:3, Fallecido:4, Moderado:5}

Predictors: 'Atencion', 'Sexo', 'Tipo', 'Estado'
Target: 'Tipo_recuperacion'

Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

#Base de datos completa
import pandas as pd
arbol=pd.read_csv('COVID/COVID_coursera_v3.csv',encoding='latin1', delimiter=';')
arbol.head()
arbol.describe()

predictors = arbol[['Atencion', 'Sexo', 'Tipo', 'Estado']]
targets = arbol.Tipo_recuperacion
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=20)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

#Running a different number of trees and see the effect of that on the accuracy of the prediction
trees=range(20)
accuracy=np.zeros(20)
for idx in range(len(trees)):
   classifier=RandomForestClassifier(n_estimators=idx + 1)
   classifier=classifier.fit(pred_train,tar_train)
   predictions=classifier.predict(pred_test)
   accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)

Results:

Confusion Matrix
array([[6270, 2225],
       [6141, 2170]])

Accuracy model score: 0.5022015946685707

Model features importance: [0.21062097 0.09510638 0.40366751 0.29060514]
                             ='Atencion', 'Sexo', 'Tipo', 'Estado'

X:Number of trees vs Y:Accuracy model score


Interpretation:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as  possible contributors to a random forest evaluating Tipo_recuperación, Atencion, Sexo, Tipo and       Estado.
The explanatory variables with the highest relative importance scores were Tipo, Estado, Atencion and    the lowest was Sexo. The accuracy of the random forest was 50.22%, with the subsequent growing of       multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting  that interpretation of a single decision tree may be appropriate. Between 2 and 5 trees the accuracy of the model is improve a little.

COVID Decision tree Classification

MACHINE LEARNING FOR DATA ANALYSIS

VARIABLES:
SEXO (SEX): {F:1, M:2}
ESTADO (STATUS): {LEVE:1, ASINTOMATICO:2, GRAVE:3, FALLECIDO:4, MODERADO:5}
ATENCIÓN (ATTENTION): {RECUPERADO:1, FALLECIDO:2}

The data has been retrieve from INS (Instituto Nacional de Salud) in Colombia.

  • predictors = arbol[['Sexo,'Atencion']]
  • targets = arbol.Estado
  • sklearn.metrics.accuracy_score(tar_test, predictions): 0.8841027080117861
Code: 

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

import pandas as pd
arbol=pd.read_csv('COVID/COVID_coursera_v2.csv',encoding='latin1', delimiter=';')
arbol.head()

predictors = arbol[['Sexo_cat','Atencion']]
targets = arbol.Estado_cat
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

from sklearn import tree
from io import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())

Decision Tree






Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the gini criterion was used to grow the tree.
The following explanatory variables were included as possible contributors to a classification tree model evaluating ESTADO (STATUS), SEXO (SEX) and ATENCION (ATTENTION).
7 nodes: 3 internal nodes and 4 terminal nodes.
The first variable to separate the sample into two subgroups is ATTENTION. ATTENTION with a deviance score less than 1.5, the recovered that are females have the following status (Leve: 4172, Asintomatico: 327, Grave: 0, Fallecido: 0, Moderado: 17). For a score greater than 1.5, the recovered that are males have the following status (Leves: 4578, Asintomatico: 767, Grave: 1, Fallecido: 0, Moderado: 24).
With a deviance score greater than 1.5, the deceased that are females have the following status (Leve: 0, Asintomatico: 0, Grave: 0, Fallecido: 308, Moderado: 0). For a score greater than 1.5, in the second split, the deceased that are males have the following status (Leve: 0, Asintomatico: 3, Grave: 0, Fallecido: 493, Moderado: 0).




Covid 19 Práctica Rmarkdown

covid covid Julian Uribe 2023-12-05 ## ── Attaching core tidyverse...