特征选择(1)–基于方差

2,840次阅读
没有评论

特征选择博文均来自于Sklearn机器学习库,基本上对应翻译而来,训练模型的好坏一定程度上受特征提取的影响,因此特征提取是重要的一步。

Removing features with low variance

【去除方差较小的特征,说白了就是当前特征对应不同的个体而言特征值基本上都是相差不大,因此不具备区分能力】

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

VarianceThreshold 是特征选取很简单的一种衡量指标,本质上就是去除方差没有达到制定标准值对应的特征,默认是移除0方差的特征(就是所有的样本对应特征值都是同一个值)】

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

【举例如下,假设我们有一个布尔类型的数据集,现在想去除其中包含0或者1占据样本总数超过80%的特征,考虑到样本是0-1二项分布,对应的方差计算公式如下】

特征选择(1)--基于方差

so we can select using the threshold .8 * (1 - .8):

>>>

<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="k">import</span> <span class="n">VarianceThreshold</span>
<span class="gp">>>> </span><span class="n">X</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]]</span>
<span class="gp">>>> </span><span class="n">sel</span> <span class="o">=</span> <span class="n">VarianceThreshold</span><span class="p">(</span><span class="n">threshold</span><span class="o">=</span><span class="p">(</span><span class="o">.</span><span class="mi">8</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="o">.</span><span class="mi">8</span><span class="p">)))</span>
<span class="gp">>>> </span><span class="n">sel</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="go">array([[0, 1],</span>
<span class="go">       [1, 0],</span>
<span class="go">       [0, 0],</span>
<span class="go">       [1, 1],</span>
<span class="go">       [1, 0],</span>
<span class="go">       [1, 1]])</span>

As expected, VarianceThreshold has removed the first column, which has a probability 特征选择(1)--基于方差 of containing a zero.

admin
版权声明:本站原创文章,由admin2017-04-23发表,共计1105字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)