martes, 23 de junio de 2020

COVID Lasso regression

VARIABLES: 

['ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'EDAD', 'TID']

Predictors:['ATENCION','SEXO','TIPO','ESTADO','TIPO_RECUPERACION','EDAD']

Target: TID (Time from symptoms to diagnosis)

CODE:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
import pandas as pd
data=pd.read_csv('COVID/COVID_coursera_v4.csv',encoding='latin1', delimiter=';')
data.head()
data.columns = map(str.upper, data.columns)
list(data.columns)
data_st=data.copy()
from sklearn import preprocessing
data_st['ATENCION']=preprocessing.scale(data_st['ATENCION'].astype('float64'))
data_st['SEXO']=preprocessing.scale(data_st['SEXO'].astype('float64'))
data_st['TIPO']=preprocessing.scale(data_st['TIPO'].astype('float64'))
data_st['ESTADO']=preprocessing.scale(data_st['ESTADO'].astype('float64'))
data_st['TIPO_RECUPERACION']=preprocessing.scale(data_st['TIPO_RECUPERACION'].astype('float64'))
data_st['EDAD']=preprocessing.scale(data_st['EDAD'].astype('float64'))
data_st['TID']=preprocessing.scale(data_st['TID'].astype('float64'))
data_st.head()
predvar= data_st[['ATENCION', 'SEXO', 'TIPO', 'ESTADO', 'TIPO_RECUPERACION', 'EDAD']]
target=data_st.TID
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
dict(zip(predictors.columns, model.coef_))
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.mse_path_, ':')
plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1),'k',label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

RESULTS:
A lasso regression analysis was conducted to identify a subset of variables from a pool of 6 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring TID (Time from symptoms to diagnosis). Categorical predictors included gender (SEXO) and a series of categorical variables which are ATENCION, TIPO, ESTADO, TIPO_RECUPERACION to improve interpretability of the selected model with fewer predictors. Quantitative predictor variable include EDAD. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations (N=29409) and a test set that included 30% of the observations (N=12604). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 6 predictor variables, all were retained in the selected model. During the estimation process, TIPO and ATENCION were most strongly associated with TID, followed by EDAD. TIPO and ESTADO were negatively associated with TID. ATENCION and EDAD were positively associated with TID. These 6 variables accounted for 7.97% of the variance in the TID response variable.
{'ATENCION': 0.1849622532820641,
 'SEXO': 0.01907117926177939,
 'TIPO': -0.2158063669582188,
 'ESTADO': -0.08533458994192221,
 'TIPO_RECUPERACION': 0.011996060873158066,
 'EDAD': 0.10939438598178146}


No hay comentarios:

Publicar un comentario

Covid 19 Práctica Rmarkdown

covid covid Julian Uribe 2023-12-05 ## ── Attaching core tidyverse...