我有这个数据:
Metric ProcId TimeStamp Value
CPU proce_123 Mar-11-2022 11:00:00 1.4453125
CPU proce_126 Mar-11-2022 11:00:00 0.058320373
CPU proce_123 Mar-11-2022 11:00:00 0.095274389
CPU proce_000 Mar-11-2022 11:00:00 0.019654088
CPU proce_144 Mar-11-2022 11:00:00 0.019841269
CPU proce_1 Mar-11-2022 11:00:00 0.234741792
CPU proce_100 Mar-11-2022 11:00:00 5.32945776
CPU proce_57777 Mar-11-2022 11:00:00 0.25390625
CPU proce_0000 Mar-11-2022 11:00:00 0.019349845
CPU proce_123 Mar-11-2022 11:00:00 0.019500781
CPU proce_123 Mar-11-2022 11:00:00 2.32421875
CPU proce_123 Mar-11-2022 11:00:00 68.3903656
CPU proce_123 Mar-11-2022 11:00:00 0.057781201
CPU proce_123 Mar-11-2022 11:00:00 0.416666627
这只是一个示例dataframe;实际的dataframe有数千行。我需要遍历这个数据块,ProdID
列,我需要为每次迭代创建一个将这些ProdID
组合在一起的字符串。
例如,给定块大小3,字符串需要如下所示:
proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
proce_123%22%2C%2proce_126%22%2C%2proce_111%22%29
请注意,在第三块之后,我们需要添加%22%29
。在第一个广告之后,我们需要添加%22%2C%2
。
我可以做一些这样的事情来打印出这些片段:
n = 3 #size of chunks
chunks = [] #list of chunks
for i in range(0, len(id), n):
chunks.append(id[i:i + n])
我不知道如何将这3项合并在一个字符串中,并在最后添加其他字符串。
发布于 2022-03-16 10:22:29
避免在for循环中遍历数据帧。如果您使用groupby
__、merge
__、shift
和其他面向数组的numpy或大熊猫操作的组合,您的性能几乎肯定会更差。
通过对索引的整数除法将数据块ids从数据中提取出来(假设增量索引值)
chunk_size = 3
df['ChunkId'] = df.index // chunk_size
向每个ProcId添加后缀以创建一个新的列ProcEnds
,然后在每个组中加入这些列。
df['ProcEnds'] = (df.ProcId + '%22%2C%2').where(
df.index % chunk_size != chunk_size - 1,
df.ProcId + '%22%29')
# note DataFrame.where replaces values with other when cond is False
df['ChunkString'] = df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
可选地,删除ChunkId
& ProcEnds
列以获得只有附加列ChunkString
的输出
df = df.drop(columns=['ChunkId', 'ProcEnds'])
df
现在输出:
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
更新
google笔记本显示带有示例数据https://colab.research.google.com/drive/1f9ZHXE2ATZXD2qWsoATxEWABIBt0tMRN?usp=sharing的输出
更新2
执行部分问:
快速提问。我们能根据df‘度量’来分组吗?例如,它将是CPU,内存。我需要基于CPU或内存的ChunkString吗?
要在每个度量组中应用此转换,最简单的方法是将转换逻辑包含在函数中并应用于数据。
需要特别注意保留原来的索引。
def transform(frame):
_df = frame.reset_index(drop=True)
_df['ChunkId'] = _df.index // chunk_size
_df['ProcEnds'] = (_df.ProcId + '%22%2C%2').where(
_df.index % chunk_size != chunk_size - 1,
_df.ProcId + '%22%29')
_df['ChunkString'] = _df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
return _df.drop(columns=['ChunkId', 'ProcEnds'])
idx = df.index
df.groupby('Metric').apply(transform).set_index(idx)
产生与先前相同的输出,为简洁而省略。
发布于 2022-03-17 00:49:13
您可以使用Python整数除法(//
)将索引形成N的组:
N = 3
df['ChunkString'] = df.groupby(df.index//N)['ProcId'].transform(lambda x: '%22%2C%2'.join(x.tolist() + ['']*(N-len(x))) + ('%22%29' if len(x) == N else ''))
备注:
x.tolist() + ['']*(N-len(x))
只将x
转换为列表,并使用空项将其填充,直到到达长度N
为止。输出
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
用N = 5
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
发布于 2022-03-13 06:11:18
chunk_size = 3
list_of_proc_ids = []
# First, generate a list of the procIds
for obj in range(0, len(id)):
list_of_proc_ids.append(procId) # Not sure how you're appending this, guessing you use a slice on the string line?
final_str = ''
# Then enumerate through that list, adding a unique ending at every third
for index, obj in enumerate(list_of_proc_ids]:
final_str += str(obj)
if (index + 1) % chunk_size == 0: # Checks if divisible by 3, accounting for 0 index
final_str += '%22%29'
else:
final_str += '%22%2C%2'
https://stackoverflow.com/questions/71454195
复制相似问题