给定一个包含混合数据类型的列的pandas.DataFrame,例如。
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})我想知道如何获得列(Series)中各个对象的数据类型?假设我想修改级数中属于某种类型的所有条目,就像用某种因子乘以所有整数一样。
我可以迭代地派生一个掩码并在loc中使用它,如
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string这是个诀窍,但我想知道是否有一种更多的pandas抽搐的方式来做到这一点呢?
发布于 2020-10-13 06:51:56
一种方法是用to_numeric和errors='coerce'测试数字,以及不缺少的值:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string不幸的是,是缓慢的,另一些想法:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)熊猫在这里默认不能有效地使用矢量化,因为混合值-因此是必要的元素方法。
发布于 2020-10-04 14:24:28
还需要调用type
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
mixed
0 2020-10-04 00:00:00
1 9990
2 a string发布于 2020-10-14 14:25:10
如果您想要多个所有的“数字”,那么您可以使用以下内容。
让我们将pd.to_numeric与参数errors = 'coerce'和fillna一起使用
df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df输出:
mixed
0 2020-10-04 00:00:00
1 9990
2 a string让我们向列中添加一个浮动
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})使用@BenYo:
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df输出(注意,只有整数999乘以10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 100.3使用@jezrael并类似于此解决方案:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)
# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])输出(注:所有数字乘以10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 1003https://stackoverflow.com/questions/64195782
复制相似问题