python - classification: PCA and logistic regression using sklearn -
step 0: problem description
i have classification problem, ie want predict binary target based on collection of numerical features, using logistic regression, , after running principal components analysis (pca).
i have 2 datasets: df_train
, df_valid
(training set , validation set respectively) pandas data frame, containing features , target. first step, have used get_dummies
pandas function transform categorical variables boolean. example, have:
n_train = 10 np.random.seed(0) df_train = pd.dataframe({"f1":np.random.random(n_train), \ "f2": np.random.random(n_train), \ "f3":np.random.randint(0,2,n_train).astype(bool),\ "target":np.random.randint(0,2,n_train).astype(bool)}) in [36]: df_train out[36]: f1 f2 f3 target 0 0.548814 0.791725 false false 1 0.715189 0.528895 true true 2 0.602763 0.568045 false true 3 0.544883 0.925597 true true 4 0.423655 0.071036 true true 5 0.645894 0.087129 true false 6 0.437587 0.020218 true true 7 0.891773 0.832620 true false 8 0.963663 0.778157 false false 9 0.383442 0.870012 true true n_valid = 3 np.random.seed(1) df_valid = pd.dataframe({"f1":np.random.random(n_valid), \ "f2": np.random.random(n_valid), \ "f3":np.random.randint(0,2,n_valid).astype(bool),\ "target":np.random.randint(0,2,n_valid).astype(bool)}) in [44]: df_valid out[44]: f1 f2 f3 target 0 0.417022 0.302333 false false 1 0.720324 0.146756 true false 2 0.000114 0.092339 true true
i apply pca reduce dimensionality of problem, use logisticregression
sklearn train , prediction on validation set, i'm not sure procedure follow correct. here do:
step 1: pca
the idea need transform both training , validation set same way pca. in other words, can not perform pca separately. otherwise, projected on different eigenvectors.
from sklearn.decomposition import pca pca = pca(n_components=2) #assume keep 2 components, doesn't matter newdf_train = pca.fit_transform(df_train.drop("target", axis=1)) newdf_valid = pca.transform(df_valid.drop("target", axis=1)) #not sure here if right
step2: logistic regression
it's not necessary, prefer keep things dataframe:
features_train = pd.dataframe(newdf_train) features_valid = pd.dataframe(newdf_valid)
and perform logistic regression
from sklearn.linear_model import logisticregression cls = logisticregression() cls.fit(features_train, df_train["target"]) predictions = cls.predict(features_valid)
i think step 2 correct, have more doubts step 1: way i'm supposed chain pca, classifier ?
there's pipeline in sklearn purpose.
from sklearn.decomposition import pca sklearn.linear_model import logisticregression sklearn.pipeline import pipeline pca = pca(n_components=2) cls = logisticregression() pipe = pipeline([('pca', pca), ('logistic', clf)]) pipe.fit(features_train, df_train["target"]) predictions = pipe.predict(features_valid)
Comments
Post a Comment