特征选择（1）–基于方差

4,787次阅读

共计 1105 个字符，预计需要花费 3 分钟才能阅读完成。

特征选择博文均来自于Sklearn机器学习库，基本上对应翻译而来，训练模型的好坏一定程度上受特征提取的影响，因此特征提取是重要的一步。

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

【VarianceThreshold 是特征选取很简单的一种衡量指标，本质上就是去除方差没有达到制定标准值对应的特征，默认是移除0方差的特征（就是所有的样本对应特征值都是同一个值）】

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

【举例如下，假设我们有一个布尔类型的数据集，现在想去除其中包含0或者1占据样本总数超过80%的特征，考虑到样本是0-1二项分布，对应的方差计算公式如下】

特征选择（1）--基于方差

so we can select using the threshold .8 * (1 - .8):

>>>

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])