有如下 Pandas DataFrame:
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd inp = [{<span class="hljs-string">'c1'</span>:<span class="hljs-number">10</span>, <span class="hljs-string">'c2'</span>:<span class="hljs-number">100</span>}, {<span class="hljs-string">'c1'</span>:<span class="hljs-number">11</span>,<span class="hljs-string">'c2'</span>:<span class="hljs-number">110</span>}, {<span class="hljs-string">'c1'</span>:<span class="hljs-number">12</span>,<span class="hljs-string">'c2'</span>:<span class="hljs-number">120</span>}] df = pd.DataFrame(inp) <span class="hljs-keyword">print</span> df
上面代码输出:
<span class="hljs-attribute">c1</span> c2 <span class="hljs-number">0</span> <span class="hljs-number">10</span> <span class="hljs-number">100</span> <span class="hljs-number">1</span> <span class="hljs-number">11</span> <span class="hljs-number">110</span> <span class="hljs-number">2</span> <span class="hljs-number">12</span> <span class="hljs-number">120</span>
现在需要遍历上面DataFrame的行。对于每一行,都希望能够通过列名访问对应的元素(单元格中的值)。也就是说,需要类似如下的功能:
<span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.rows: <span class="hljs-keyword">print</span> row[<span class="hljs-string">'c1'</span>], row[<span class="hljs-string">'c2'</span>]
Pandas 可以这样做吗?
我找到了similar question。但这并不能给我需要的答案,里面提到:
<span class="hljs-keyword">for</span> date, row <span class="hljs-keyword">in</span> df.T.iteritems():
要么
<span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.iterrows():
但是我不明白row
对象是什么,以及我如何使用它。
最佳解决方案
要以 Pandas 的方式迭代遍历DataFrame的行,可以使用:
- DataFrame.iterrows()
<span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df.iterrows(): <span class="hljs-keyword">print</span> row[<span class="hljs-string">"c1"</span>], row[<span class="hljs-string">"c2"</span>]
- DataFrame.itertuples()
<span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.itertuples(index=<span class="hljs-keyword">True</span>, name=<span class="hljs-string">'Pandas'</span>): <span class="hljs-keyword">print</span> getattr(row, <span class="hljs-string">"c1"</span>), getattr(row, <span class="hljs-string">"c2"</span>)
itertuples()
应该比iterrows()
快
但请注意,根据文档(目前 Pandas 0.19.1):
- iterrows:数据的
dtype
可能不是按行匹配的,因为iterrows返回一个系列的每一行,它不会保留行的dtypes(dtypes跨DataFrames列保留)* - iterrows:不要修改行
你不应该修改你正在迭代的东西。这不能保证在所有情况下都能正常工作。根据数据类型的不同,迭代器返回一个副本而不是一个视图,写入它将不起作用。
<span class="hljs-attr">new_df</span> = df.apply(lambda x: x * <span class="hljs-number">2</span>)
- itertuples:列名称将被重命名为位置名称,如果它们是无效的Python标识符,重复或以下划线开头。对于大量的列(> 255),返回常规元组。
第二种方案: apply
您也可以使用df.apply()
遍历行并访问函数的多个列。
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">valuation_formula</span><span class="hljs-params">(x, y)</span>:</span> <span class="hljs-keyword">return</span> x * y * <span class="hljs-number">0.5</span> df[<span class="hljs-string">'price'</span>] = df.apply(<span class="hljs-keyword">lambda</span> row: valuation_formula(row[<span class="hljs-string">'x'</span>], row[<span class="hljs-string">'y'</span>]), axis=<span class="hljs-number">1</span>)
第三种方案:iloc
您可以使用df.iloc函数,如下所示:
<span class="hljs-selector-tag">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-selector-tag">in</span> <span class="hljs-selector-tag">range</span>(0, <span class="hljs-selector-tag">len</span>(<span class="hljs-selector-tag">df</span>)): <span class="hljs-selector-tag">print</span> <span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.iloc</span><span class="hljs-selector-attr">[i]</span><span class="hljs-selector-attr">['c1']</span>, <span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.iloc</span><span class="hljs-selector-attr">[i]</span><span class="hljs-selector-attr">['c2']</span>
第四种方案:略麻烦,但是更高效,将DataFrame转为List
您可以编写自己的实现namedtuple
的迭代器
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> namedtuple <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">myiter</span><span class="hljs-params">(d, cols=None)</span>:</span> <span class="hljs-keyword">if</span> cols <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span>: v = d.values.tolist() cols = d.columns.values.tolist() <span class="hljs-keyword">else</span>: j = [d.columns.get_loc(c) <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> cols] v = d.values[:, j].tolist() n = namedtuple(<span class="hljs-string">'MyTuple'</span>, cols) <span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> iter(v): <span class="hljs-keyword">yield</span> n(*line)
这相当于pd.DataFrame.itertuples
,但是效率更高。
将自定义函数用于给定的DataFrame:
<span class="hljs-keyword">list</span>(myiter(df)) [MyTuple(c1=<span class="hljs-number">10</span>, c2=<span class="hljs-number">100</span>), MyTuple(c1=<span class="hljs-number">11</span>, c2=<span class="hljs-number">110</span>), MyTuple(c1=<span class="hljs-number">12</span>, c2=<span class="hljs-number">120</span>)]
或与pd.DataFrame.itertuples
:
<span class="hljs-keyword">list</span>(df.itertuples(index=<span class="hljs-keyword">False</span>)) [Pandas(c1=<span class="hljs-number">10</span>, c2=<span class="hljs-number">100</span>), Pandas(c1=<span class="hljs-number">11</span>, c2=<span class="hljs-number">110</span>), Pandas(c1=<span class="hljs-number">12</span>, c2=<span class="hljs-number">120</span>)]
全面的测试
我们测试了所有可用列:
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">iterfullA</span><span class="hljs-params">(d)</span>:</span> <span class="hljs-keyword">return</span> list(myiter(d)) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">iterfullB</span><span class="hljs-params">(d)</span>:</span> <span class="hljs-keyword">return</span> list(d.itertuples(index=<span class="hljs-keyword">False</span>)) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">itersubA</span><span class="hljs-params">(d)</span>:</span> <span class="hljs-keyword">return</span> list(myiter(d, [<span class="hljs-string">'col3'</span>, <span class="hljs-string">'col4'</span>, <span class="hljs-string">'col5'</span>, <span class="hljs-string">'col6'</span>, <span class="hljs-string">'col7'</span>])) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">itersubB</span><span class="hljs-params">(d)</span>:</span> <span class="hljs-keyword">return</span> list(d[[<span class="hljs-string">'col3'</span>, <span class="hljs-string">'col4'</span>, <span class="hljs-string">'col5'</span>, <span class="hljs-string">'col6'</span>, <span class="hljs-string">'col7'</span>]].itertuples(index=<span class="hljs-keyword">False</span>)) res = pd.DataFrame( index=[<span class="hljs-number">10</span>, <span class="hljs-number">30</span>, <span class="hljs-number">100</span>, <span class="hljs-number">300</span>, <span class="hljs-number">1000</span>, <span class="hljs-number">3000</span>, <span class="hljs-number">10000</span>, <span class="hljs-number">30000</span>], columns=<span class="hljs-string">'iterfullA iterfullB itersubA itersubB'</span>.split(), dtype=float ) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> res.index: d = pd.DataFrame(np.random.randint(<span class="hljs-number">10</span>, size=(i, <span class="hljs-number">10</span>))).add_prefix(<span class="hljs-string">'col'</span>) <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> res.columns: stmt = <span class="hljs-string">'{}(d)'</span>.format(j) setp = <span class="hljs-string">'from __main__ import d, {}'</span>.format(j) res.at[i, j] = timeit(stmt, setp, number=<span class="hljs-number">100</span>) res.groupby(res.columns.str[<span class="hljs-number">4</span>:<span class="hljs-number">-1</span>], axis=<span class="hljs-number">1</span>).plot(loglog=<span class="hljs-keyword">True</span>);