我有一个像下面这样没有标题的Dataframe。
当前df:
Col 0 Col 1 Col 2 Col3
2345 abcd mobile oneplus
4567 abbb internet explorer
mozilla 2345 cccc dddd
eeee bbbb 1234 hello
我想让数字值(它是ID
)作为第一列(列索引为0)。
如果在任何一行中,数字值移到Col 1
,则组合Col 1
和Col2
值并将其放入Col 1
,然后将Col3
值放入Col2
,然后将下一行的Col0
值作为前一行的Col3
。
预期输出如下:
Col 0 Col 1 Col 2 Col3
2345 abcd mobile oneplus
4567 abbbinternet explorer mozilla
2345 ccccdddd eeee bbbb
1234 hello
非常感谢您的帮助!谢谢..
发布于 2018-07-31 06:02:28
您可以使用stack
、set_index
和unstack
来完成此操作
from io import StringIO
txt = StringIO("""2345 abcd mobile oneplus
4567 abbb internet explorer
mozilla 2345 cccc dddd
eeee bbbb 1234 hello""")
df = pd.read_csv(txt, header=None, sep='\s+')
df = df.stack().reset_index(drop=True)
df1 = df.to_frame().set_index(df.str.isnumeric().cumsum())
df_out = df1.set_index(df1.groupby(level=0).cumcount(), append=True)[0].unstack()
df_out
输出:
0 1 2 3 4
1 2345 abcd mobile oneplus NaN
2 4567 abbb internet explorer mozilla
3 2345 cccc dddd eeee bbbb
4 1234 hello NaN NaN NaN
发布于 2018-07-31 08:51:43
在将这些数据读取到pandas之前,做一些清理工作可能会更容易。假设你的数据是一个CSV,不是最漂亮的代码,但是下面的代码应该可以做到:
import numpy as np
import pandas as pd
import re
filename = "<path to file>.csv"
new_file = "<path to where fixed csv should go>.csv"
with open(filename, "r") as infile:
text = infile.read()
# get rid of existing new line characters
text = text.replace("\n", ",")
# put a new line before every number
out = re.sub("([0-9]+)", "\n\\1", text)
# write out
with open(new_file, "w+") as outfile:
outfile.write(out)
# read in the fixed csv -- need to provide a number of columns
# greater than you'll need (using 50 here), and then cut the excess
df = pd.read_csv(new_file, header=None, names=range(50)).dropna(how="all", axis=1)
# jam as many columns into column1 as necessary to get just 3 after ID
df["cols_to_jam"] = df[df.columns[1:]].notnull().sum(axis=1) - 3
def jam(row):
if row["cols_to_jam"] > 0:
new = ""
for col in range(1, row["cols_to_jam"] + 2):
new += str(row[col])
else:
new = row[1]
return new
idx = df[0]
col1 = df.apply(jam, axis=1)
# blank out jammed values
for i, row in df.iterrows():
if row["cols_to_jam"] > 0:
for col in range(1, row["cols_to_jam"] + 2):
df.ix[i, col] = np.nan
else:
df.ix[i, 1] = np.nan
del df["cols_to_jam"], df[0]
remaining_cols = df.apply(lambda x: list(x.dropna().tail(2).values), axis=1).apply(pd.Series)
remaining_cols.columns = ["col2", "col3"]
# put it all together
output = idx.to_frame("id").join(col1.to_frame("col1")).join(remaining_cols)
https://stackoverflow.com/questions/51602750
复制相似问题