特征工程(3)-数据预处理归一化

3,273次阅读次阅读
没有评论

上一篇文章讲解了数据预处理区间缩放法,这篇文章主要讲解数据归一化处理

既然讲到归一化和前面的标准化需要对比一下,首先二者处理的维度是不一样的,假设我们要处理的矩阵是m*n大小,m个样本,n维特征

标准化处理的方式是以列为单位,也就是处理的对象是

$$m*k    k={1……..n}$$

归一化处理的方式却是以行为单位,处理的对象如下:

$$k*n    k={1……..m}$$

归一化处理对样本的特征向量处理之后使其变为单位向量,单位向量可以用于点积运算或者计算样本之间相似度

L2归一化的计算公式如下所示:

$$ y=\frac{x}{\sqrt{\sum_{i=0}^{n}x^2}} $$

L1归一化的计算公式如下所示

$$ y=\frac{x}{\sum_{i=0}^{n}\|x\|} $$

sklearn代码

from sklearn.preprocessing import Normalizer
tmp=Normalizer().fit_transform(irisdata.data)
print tmp[0:5]

实验结果如下

[[ 0.80377277  0.55160877  0.22064351  0.0315205 ]
 [ 0.82813287  0.50702013  0.23660939  0.03380134]
 [ 0.80533308  0.54831188  0.2227517   0.03426949]
 [ 0.80003025  0.53915082  0.26087943  0.03478392]
 [ 0.790965    0.5694948   0.2214702   0.0316386 ]]

spark代码如下

<span class="kn">from</span> <span class="nn">pyspark.ml.feature</span> <span class="kn">import</span> <span class="n">Normalizer</span>

<span class="n">dataFrame</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">"libsvm"</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="p">)</span>

<span class="c"># p=2则表示是l2归一化</span>
<span class="n">normalizer</span> <span class="o">=</span> <span class="n">Normalizer</span><span class="p">(</span><span class="n">inputCol</span><span class="o">=</span><span class="s">"features"</span><span class="p">,</span> <span class="n">outputCol</span><span class="o">=</span><span class="s">"normFeatures"</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">l1NormData</span> <span class="o">=</span> <span class="n">normalizer</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">dataFrame</span><span class="p">)</span>
<span class="n">l1NormData</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
1
admin
版权声明:本站原创文章,由admin2017-08-17发表,共计2349字。
转载提示:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)