Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the
SelectKBestremoves all but the highest scoring features
SelectPercentileremoves all but a user-specified highest scoring percentage of features
- using common univariate statistical tests for each feature: false positive rate
SelectFpr, false discovery rate
SelectFdr, or family wise error
- （使用FPR ,FDR FWE指标）
GenericUnivariateSelectallows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.
For instance, we can perform a test to the samples to retrieve only the two best features as follows:
>>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectKBest >>> from sklearn.feature_selection import chi2 >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y) >>> X_new.shape (150, 2)
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for
p 值表示概率。对于模式分析工具来说，p 值表示所观测到的空间模式是由某一随机过程创建而成的概率。当 p 很小时，意味着所观测到的空间模式不太可能产生于随机过程（小概率事件），因此您可以拒绝零假设。
- For regression:
- For classification:
The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
【An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact “F-tests” mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.】
Feature selection with sparse data
If you use sparse data (i.e. data represented as sparse matrices),
mutual_info_classifwill deal with the data without making it dense.
Beware not to use a regression scoring function with a classification problem, you will get useless results.