首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >将包含日期时间的分组Pandas数据帧绘制到单个时间线中

将包含日期时间的分组Pandas数据帧绘制到单个时间线中
EN

Stack Overflow用户
提问于 2018-06-17 00:58:00
回答 2查看 1.1K关注 0票数 0

我正在尝试解析一个日志文件(具体地说,来自Gradle构建),它看起来像这样:

21:51:38.991 [DEBUG] [TestEventLogger] cha.LoginTest4 STARTED
21:51:39.054 [DEBUG] [TestEventLogger] cha.LoginTest2 STARTED
21:51:40.068 [DEBUG] [TestEventLogger] cha.LoginTest4 PASSED
21:51:40.101 [DEBUG] [TestEventLogger] cha.LoginTest2 PASSED
21:51:40.366 [DEBUG] [TestEventLogger] cha.LoginTest1 STARTED
21:51:40.413 [DEBUG] [TestEventLogger] cha.LoginTest3 STARTED
21:51:50.435 [DEBUG] [TestEventLogger] cha.LoginTest1 PASSED
21:51:50.463 [DEBUG] [TestEventLogger] cha.LoginTest3 PASSED
21:51:50.484 [DEBUG] [TestEventLogger] Gradle Test Run :test PASSED
21:51:38.622 [DEBUG] [TestEventLogger] Gradle Test Run :test STARTED

转换成一张显示事件时间线的图表。有点像这样:

n |  ======= 
a |   === 
m |       == 
e |    ======= 
  |______________
     time

到目前为止,我已经解析了日志,并将相关的“事件”放入Pandas数据帧(按时间戳排序)。

log events parsed, sorted and ungrouped:
                 timestamp            name
0 1900-01-01 21:51:38.622            test
0 1900-01-01 21:51:38.991  cha.LoginTest4
0 1900-01-01 21:51:39.054  cha.LoginTest2
0 1900-01-01 21:51:40.068  cha.LoginTest4
0 1900-01-01 21:51:40.101  cha.LoginTest2
0 1900-01-01 21:51:40.366  cha.LoginTest1
0 1900-01-01 21:51:40.413  cha.LoginTest3
0 1900-01-01 21:51:50.435  cha.LoginTest1
0 1900-01-01 21:51:50.463  cha.LoginTest3
0 1900-01-01 21:51:50.484            test

因为我需要每个“名称”的开始和结束时间,所以我做了一个groupby。我得到的组看起来像这样:

group                 timestamp            name
0       1900-01-01 21:51:38.991  cha.LoginTest4
0       1900-01-01 21:51:40.068  cha.LoginTest4

总是有两行,第一行是开始时间,最后一行是结束时间。我能够使用hlines来显示每个组的时间线。但是,我希望将所有组放入相同的图中,以查看它们何时开始/结束。我仍然喜欢使用groupby,因为它让我只需几行代码就可以获得开始/结束时间和“名称”。

我只能显示每个组的曲线图,而不是所有组的曲线图,而不会出现错误。下面是我为显示每个图所做的工作:

for name, group in df.groupby('name', sort=False):

    group.amin = group['timestamp'].iloc[0] # assume sorted order
    group.amax = group['timestamp'].iloc[1]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax = ax.xaxis_date()
    ax = plt.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))

    plt.show()

解决了的全部源代码:

import os
import re
import pandas as pd
from pandas import Timestamp
import matplotlib.pyplot as plt
import matplotlib.dates as dt
import warnings
from random import random
from matplotlib.pyplot import text
from datetime import datetime
import numpy as np

warnings.simplefilter(action='ignore', category=FutureWarning) # https://stackoverflow.com/a/46721064

'''
The log contents are not guaranteed to be in order. Multiple processes are dumping contents into a single file.
Contents from a single process will be in order.
'''

def main():

    log_file_path = "gradle-4.2.test.debug.log"

    # regex to get test and task log events
    test_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] (\S+[^:>]) (STARTED|PASSED|FAILED)$')
    task_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] Gradle Test Run [:](\S+) (STARTED|PASSED|FAILED)$')

    df = pd.DataFrame()
    with open(log_file_path, "r") as file:
        for line in file:
            test_match = test_re.findall(line)
            if test_match:
                df = df.append(test_match)
            else:
                task_match = task_re.findall(line)
                if task_match:
                    df = df.append(task_match)

    file.close()

    df.columns = ['timestamp','name','type']
    df.drop('type', axis=1, inplace=True) # don't need this col
    df['timestamp'] = pd.to_datetime(df.timestamp, format="%H:%M:%S.%f") # pandas datetime
    df =  df.sort_values('timestamp')  # sort by  pandas datetime

    print ("log events parsed, sorted and ungrouped:\n", df)

    fig, ax = plt.subplots()
    ax.xaxis_date()

    # Customize the major grid
    ax.minorticks_on()
    ax.grid(which='major', linestyle='-', linewidth='0.2', color='gray')

    i = 0 # y-coord will be loop iteration

    # Groupby name. Because the df was previously sorted, the tuple will be sorted order (first event, second event)
    # Give each group an hline.
    for name, group in df.groupby('name', sort=False):
        i += 1

        assert group['timestamp'].size == 2 # make sure we have a start & end time for each test/task
        group.amin = group['timestamp'].iloc[0] # assume sorted order
        group.amax = group['timestamp'].iloc[1]
        assert group.amin < group.amax # make sure start/end times are in order

        if '.' in name: # assume '.' indicates a JUnit test, not a task
            color = [(random(),random(),random())]
            linestyle = 'solid'
            ax.text(group.amin, (i + 0.05), name, color='blue') # add name to x, y+.05 to hline
        else: # a task.
            color = 'black'
            linestyle = 'dashed'
            ax.text(group.amin, (i + 0.05), name + ' (Task)', color='red') # add name to x, y+.05 to hline

        ax.hlines(i, dt.date2num(group.amin), dt.date2num(group.amax), linewidth = 6, color=color, linestyle=linestyle)

    # Turn off y ticks. These are just execution order (numbers won't make sense).
    plt.setp(ax.get_yticklabels(), visible=False)
    ax.yaxis.set_tick_params(size=0)
    ax.yaxis.tick_left()

    plt.title('Timeline of Gradle Task and Test Execution')
    plt.xlabel('Time')
    plt.ylabel('Execution Order')
    plt.show()
#    plt.savefig('myfig')


if __name__ == '__main__':
    main()

那么,如何将这个充满时间戳的分组数据帧放入显示开始/结束时间线的单个图表中呢?

似乎我在regex、dataframes、datetime等方面遇到了这样或那样的问题,但我认为我得到了一个很好的干净的解决方案……

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-06-17 01:13:05

现在不能测试,抱歉,但这(或类似的东西)应该会有帮助:在绘图循环之前创建一个单独的图形,然后将每组的数据绘制到一个单独的轴上。

fig, ax = plt.subplots()
ax.xaxis_date()
for name, group in df.groupby('name', sort=False):

    group.amin = group['timestamp'].iloc[0] # assume sorted order
    group.amax = group['timestamp'].iloc[1]

    ax.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))

plt.show()
票数 0
EN

Stack Overflow用户

发布于 2018-06-19 04:03:43

我对这个问题的第一个联想是使用plt.barh -但我必须承认我在datetime / time主题上挣扎了一段时间,直到得到我想要的结果……

然而,这就是这个想法的结果:

假设以下面的数据帧为起点:

df
Out: 
      timestamp            name
0  21:51:38.622            test
1  21:51:38.991  cha.LoginTest4
2  21:51:39.054  cha.LoginTest2
3  21:51:40.068  cha.LoginTest4
4  21:51:40.101  cha.LoginTest2
5  21:51:40.366  cha.LoginTest1
6  21:51:40.413  cha.LoginTest3
7  21:51:50.435  cha.LoginTest1
8  21:51:50.463  cha.LoginTest3
9  21:51:50.484            test

首先,我会按名称分组并创建一个新的数据帧,其中包含matplotlib.dates数据类型的开始和持续时间数据:

grpd = df.groupby('name')
plot_data = pd.DataFrame({'start': dt.date2num(pd.to_datetime(grpd.min().timestamp)), 'stop':  dt.date2num(pd.to_datetime(grpd.max().timestamp))}, grpd.min().index)

从零开始减去第一个开始时间(仍然添加1,因为matplotlib.dates就是这样启动的)

plot_data -= plot_data.start.min() - 1
plot_data['duration'] = plot_data.stop - plot_data.start

根据此数据框架,很容易绘制随时间变化的水平条形图:

fig, ax = plt.subplots(figsize=(8,4))
ax.xaxis_date()
ax.barh(plot_data.index, plot_data.duration, left=plot_data.start, height=.4)
plt.tight_layout()

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50889901

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档