特征工程(4)-数据预处理二值化

3,528次阅读
没有评论

上一篇文章讲解了区间缩放法处理数据,接下来就讲解二值化处理

这个应该很简单了,从字面意思就是将数据分为0或者1,联想到之前图像处理里面二值化处理变为黑白图片


下面还是进入主题吧

首先给出当前的二值化处理公式:

$$ y = \left\{ \begin{array}{ll} 0 & \textrm{if \(x<=\theta\)}\\ 1& \textrm{if \(x>\theta\)}\\ \end{array} \right. $$

上图中\(\theta\)是设定的阈值,特征值与阈值相比较,大于阈值则为1小于等于阈值为0

sklearn函数

 

from sklearn.preprocessing import  Binarizer
tmp=Binarizer().fit_transform(irisdata.data)
print tmp[0:5]

输出结果

[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

spark 函数

<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">0.5</span><span class="p">,)],</span> <span class="p">[</span><span class="s">"values"</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">binarizer</span> <span class="o">=</span> <span class="n">Binarizer</span><span class="p">(</span><span class="n">threshold</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">inputCol</span><span class="o">=</span><span class="s">"values"</span><span class="p">,</span> <span class="n">outputCol</span><span class="o">=</span><span class="s">"features"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">binarizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span><span class="o">.</span><span class="n">features</span>
<span class="go">0.0
</span>#setParams是用来设置二值化参数
<span class="gp">>>> </span><span class="n">binarizer</span><span class="o">.</span><span class="n">setParams</span><span class="p">(</span><span class="n">outputCol</span><span class="o">=</span><span class="s">"freqs"</span><span class="p">)</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span><span class="o">.</span><span class="n">freqs</span>
<span class="go">0.0</span>
<span class="gp">>>> </span><span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="n">binarizer</span><span class="o">.</span><span class="n">threshold</span><span class="p">:</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">binarizer</span><span class="o">.</span><span class="n">outputCol</span><span class="p">:</span> <span class="s">"vector"</span><span class="p">}</span>
<span class="gp">>>> </span><span class="n">binarizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span><span class="o">.</span><span class="n">vector</span>
<span class="go">1.0</span>
 transform(datasetparams=None),其中param可以是字典参数,字典的键是类对象成员
admin
版权声明:本站原创文章,由admin2017-08-17发表,共计857字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)