Skip to content Skip to sidebar Skip to footer

How To Find Most Relevant Dimensions/ Columns To Separate Known Classes

I have data acquired from thousands of cancer cells. 60 measurements per cell stored in a pandas dataframe. The cells are classified into 3 populations using another method. I wou

Solution 1:

You want to use Linear Discriminant Analysis (LDA) instead of PCA.

PCA only finds components which represent the complete data set of all classes in an optimal way. In contrast to this, you want to find those components which help you to distinguish best between the different classes, which is what LDA is for.

Have a look at this example:

http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html

If you have trouble adapting this for you data, feel free to provide sample data and some LDA code and let us know where you are stuck.

[EDIT: sample code is here: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html ]

Solution 2:

You can determine feature importances via Random Forest as well. Rather than finding components that best distinguish between classes, this will tell you relative importance of your original features (which sounds like what you were asking for). Here's a link:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Solution 3:

Just to add the above discussion, it is important to understand the difference between the following two points:

  • Feature Selection in very high-dimensional datasets:Feture Selection in sklearn. Using Random forest is another way of feature selection based on feature importances.
  • Dimensionality Reduction: This is a technique where dataset is transformed into a new feature space of lower dimensionality than the original one. PCA, LDA and kernel PCA are such techniques. PCA is unsupervised technique whereas LDA is supervised technique. PCA and LDA works well if dataset is linearly separable. If dataset is not linearly separable, then kernel PCA can be used to transform the data to new lower dimensional subspace that is suitable for linear classifier.

Post a Comment for "How To Find Most Relevant Dimensions/ Columns To Separate Known Classes"