我收到了一些短信:
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
它看起来像这样:
"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"
我正在尝试解析出它们里面的消息。最终,我希望有一个列表或字典,其中有From和To,然后是用来做一些分析的消息体。
我试着通过把所有东西都调低,然后拆分字符串来解析它。
text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]
有没有更好的方法来做这件事?
发布于 2018-08-03 23:48:25
您可以使用re
拆分邮件(explanation of this regexp on external site)。结果是包含键'from'
、'to'
、'subject'
和'message'
的字典列表:
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
import re
from pprint import pprint
groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['from'] = g[0].strip()
d['to'] = g[1].strip()
d['subject'] = g[2].strip()
d['message'] = g[3].strip()
emails.append(d)
pprint(emails)
打印:
[{'from': "'Mark Twain' <mark.twain@gmail.com>",
'message': 'Ed,\n'
'\n'
"I just read the Tell Tale Heart. You've got problems man.\n"
'\n'
'Sincerely,\n'
'Marky Mark',
'subject': 'RE:Hello!',
'to': "'Edgar Allen Poe' <eap@gmail.com>"},
{'from': "'Edgar Allen Poe' <eap@gmail.com>",
'message': 'Mark,\n'
'\n'
'The world is crushing my soul, and so are you.\n'
'\n'
'Regards,\n'
'Edgar',
'subject': 'RE: Hello!',
'to': "'Mark Twain' <mark.twain@gmail.com>"}]
发布于 2018-08-04 23:48:10
如果您想要实现的只是解析包含标准格式电子邮件的字符串,那么可以使用email.parser
module;它是标准库的一部分。
您仍然需要在较大的文本中分隔电子邮件,但From: ...
头可以帮助实现这一点,使用正则表达式:
import re
from email import parser, policy
email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
parser = parser.Parser(policy=policy.default)
for email_text in email_start.split(text):
message = parser.parsestr(email_text)
to, from_ = message['to'], message['from']
body = message.get_payload()
# do something with the email details
正则表达式匹配前面紧跟另一个换行符(因此有一个空行)、文本From:
和至少一个空格(因此下一行看起来像电子邮件From:
标题)的任何换行符。
试图通过删除或替换标点符号来获取这些相同的部分并不是获取相同信息的非常有效的方法,即使您正确使用了这些工具。
演示:
>>> import re
>>> from email import parser, policy
>>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
>>> parser = parser.Parser(policy=policy.default)
>>> for email_text in email_start.split(text):
... message = parser.parsestr(email_text)
... to, from_ = message['to'], message['from']
... body = message.get_payload()
... print('Email from:', from_)
... print('Email to:', to)
... print('Third line:', body.splitlines(True)[2])
...
Email from: 'Mark Twain' <mark.twain@gmail.com>
Email to: 'Edgar Allen Poe' <eap@gmail.com>
Third line: I just read the Tell Tale Heart. You've got problems man.
Email from: 'Edgar Allen Poe' <eap@gmail.com>
Email to: 'Mark Twain' <mark.twain@gmail.com>
Third line: The world is crushing my soul, and so are you.
https://stackoverflow.com/questions/51676027
复制相似问题