python - classification: PCA and logistic regression using sklearn -


step 0: problem description

i have classification problem, ie want predict binary target based on collection of numerical features, using logistic regression, , after running principal components analysis (pca).

i have 2 datasets: df_train , df_valid (training set , validation set respectively) pandas data frame, containing features , target. first step, have used get_dummies pandas function transform categorical variables boolean. example, have:

n_train = 10 np.random.seed(0) df_train = pd.dataframe({"f1":np.random.random(n_train), \                          "f2": np.random.random(n_train), \                          "f3":np.random.randint(0,2,n_train).astype(bool),\                          "target":np.random.randint(0,2,n_train).astype(bool)})  in [36]: df_train out[36]:           f1        f2     f3 target 0  0.548814  0.791725  false  false 1  0.715189  0.528895   true   true 2  0.602763  0.568045  false   true 3  0.544883  0.925597   true   true 4  0.423655  0.071036   true   true 5  0.645894  0.087129   true  false 6  0.437587  0.020218   true   true 7  0.891773  0.832620   true  false 8  0.963663  0.778157  false  false 9  0.383442  0.870012   true   true  n_valid = 3 np.random.seed(1) df_valid = pd.dataframe({"f1":np.random.random(n_valid), \                          "f2": np.random.random(n_valid), \                          "f3":np.random.randint(0,2,n_valid).astype(bool),\                          "target":np.random.randint(0,2,n_valid).astype(bool)})  in [44]: df_valid out[44]:           f1        f2     f3 target 0  0.417022  0.302333  false  false 1  0.720324  0.146756   true  false 2  0.000114  0.092339   true   true 

i apply pca reduce dimensionality of problem, use logisticregression sklearn train , prediction on validation set, i'm not sure procedure follow correct. here do:

step 1: pca

the idea need transform both training , validation set same way pca. in other words, can not perform pca separately. otherwise, projected on different eigenvectors.

from sklearn.decomposition import pca  pca = pca(n_components=2) #assume keep 2 components, doesn't matter newdf_train = pca.fit_transform(df_train.drop("target", axis=1)) newdf_valid = pca.transform(df_valid.drop("target", axis=1)) #not sure here if right 

step2: logistic regression

it's not necessary, prefer keep things dataframe:

features_train = pd.dataframe(newdf_train) features_valid = pd.dataframe(newdf_valid)   

and perform logistic regression

from sklearn.linear_model import logisticregression cls = logisticregression()  cls.fit(features_train, df_train["target"]) predictions = cls.predict(features_valid) 

i think step 2 correct, have more doubts step 1: way i'm supposed chain pca, classifier ?

there's pipeline in sklearn purpose.

from sklearn.decomposition import pca sklearn.linear_model import logisticregression sklearn.pipeline import pipeline  pca = pca(n_components=2) cls = logisticregression()   pipe = pipeline([('pca', pca), ('logistic', clf)]) pipe.fit(features_train, df_train["target"]) predictions = pipe.predict(features_valid) 

Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -