• 为了保证你在浏览本网站时有着更好的体验，建议使用类似Chrome、Firefox之类的浏览器~~
• 如果你喜欢本站的内容何不Ctrl+D收藏一下呢，与大家一起分享各种编程知识~
• 本网站研究机器学习、计算机视觉、模式识别~当然不局限于此，生命在于折腾，何不年轻时多折腾一下

特征选择（2）–不变（单）特征选择

3年前 (2017-04-23) 1635次浏览

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transformmethod:

【不变特征的选择是在固定的统计数据上筛选出最好的特征，可以看作是训练预测模型的预处理过程】

• SelectKBest removes all but the highest scoring features
• （选取得分最高的 K 个特征）
• SelectPercentile removes all but a user-specified highest scoring percentage of features
• （用户自定义最优特征数占据总特征的百分比，假设设为 10%，则得分最高前 10%特征为筛选出来的特征）
• using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rateSelectFdr, or family wise error SelectFwe.
• （使用 FPR ,FDR FWE 指标）
• GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

For instance, we can perform a test to the samples to retrieve only the two best features as follows:

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)


These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBestand SelectPercentile):

【通过返回得分和 Pvalue 来判定，Pvalue 反应当前得分的特征被提取使用的概率

p 值表示概率。对于模式分析工具来说，p 值表示所观测到的空间模式是由某一随机过程创建而成的概率。当 p 很小时，意味着所观测到的空间模式不太可能产生于随机过程（小概率事件），因此您可以拒绝零假设。

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

F-test 维基百科的解释

【An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact “F-tests” mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.[1]

Feature selection with sparse data

If you use sparse data (i.e. data represented as sparse matrices), chi2, mutual_info_regression, mutual_info_classifwill deal with the data without making it dense.

Warning

Beware not to use a regression scoring function with a classification problem, you will get useless results.

Deeplearn, 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权 , 转载请注明特征选择（2）–不变（单）特征选择

• 版权声明

本站的文章和资源来自互联网或者站长
的原创，按照 CC BY -NC -SA 3.0 CN
协议发布和共享，转载或引用本站文章
应遵循相同协议。如果有侵犯版权的资
源请尽快联系站长，我们会在24h内删
除有争议的资源。
• 网站驱动

• 友情链接

• 支持主题

邮箱：service@deeplearn.me