我正在写一个脚本,它解析大(400MB)的apache日志文件到一个pandas表。
我的旧笔记本电脑使用该脚本在大约2分钟内解析了apache日志文件。现在我想知道它是不是不能更快了?
apache日志文件结构如下: Ip -- timestamp“GET…方法“http-status-code bytes”address“”useragent“例如:
93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"我的代码使用regex findall。我还测试了匹配和搜索方法。但它们看起来更慢。
reg_dic = {
"ip" : r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
"timestamp" : r'\[\d+\/\w+\/\d+\:\d+\:\d+\:\d+\s\+\d+\]',
"method" : r'"(.*?)"',
"httpstatus" : r'\s\d{1,3}\s',
"bytes_" : r'\s\d+\s\"',
"adress" : r'\d\s\"(.*?)"',
"useragent" : r'\"\s\"(.*?)"'
}
for name, reg in reg_dic.items() :
item_list = []
with open ( file ) as f_obj:
for line in f_obj :
item = re.findall( reg , line)
item = item[0]
if name == "bytes_" :
item = item.replace("\"", "")
item = item.strip()
item_list.append( item )
df[ name ] = item_list
del item_list发布于 2017-10-07 22:08:30
我不认为我们需要太多的RegEx来完成这个简单的任务:
fn = r'D:\temp\.data\46620093.log'
cols = ['ip','l','userid','timestamp','tz','request','status','bytes','referer','useragent']
df = pd.read_csv(fn, delim_whitespace=True, names=cols).drop('l', 1)这就给我们提供了:
In [179]: df
Out[179]:
ip userid timestamp tz request \
0 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
1 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
2 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
status bytes referer \
0 200 575 http://google.com
1 200 575 http://google.com
2 200 575 http://google.com
useragent
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...现在我们只需要将timestamp和tz连接成一列,并去掉[和]
df['timestamp'] = df['timestamp'].str.replace(r'\[(\d+/\w+/\d+):(\d+:\d+:\d+)', r'\1 \2') \
+ ' ' + df.pop('tz').str.strip(r'[\]]') 结果:
In [181]: df
Out[181]:
ip userid timestamp request \
0 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
1 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
2 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
status bytes referer \
0 200 575 http://google.com
1 200 575 http://google.com
2 200 575 http://google.com
useragent
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...注意:我们可以很容易地将datetime转换为datetime数据类型(在没有时区的UTC时间中):
In [182]: pd.to_datetime(df['timestamp'])
Out[182]:
0 2016-08-13 03:34:12
1 2016-08-13 03:34:12
2 2016-08-13 03:34:12
Name: timestamp, dtype: datetime64[ns]https://stackoverflow.com/questions/46620093
复制相似问题