首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >清理文本分析python的电子邮件链

清理文本分析python的电子邮件链
EN

Stack Overflow用户
提问于 2018-08-03 23:39:28
回答 2查看 3.2K关注 0票数 5

我收到了一些短信:

代码语言:javascript
复制
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

它看起来像这样:

代码语言:javascript
复制
"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"

我正在尝试解析出它们里面的消息。最终,我希望有一个列表或字典,其中有From和To,然后是用来做一些分析的消息体。

我试着通过把所有东西都调低,然后拆分字符串来解析它。

代码语言:javascript
复制
text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]

有没有更好的方法来做这件事?

EN

回答 2

Stack Overflow用户

发布于 2018-08-03 23:48:25

您可以使用re拆分邮件(explanation of this regexp on external site)。结果是包含键'from''to''subject''message'的字典列表:

代码语言:javascript
复制
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)

打印:

代码语言:javascript
复制
[{'from': "'Mark Twain' <mark.twain@gmail.com>",
  'message': 'Ed,\n'
             '\n'
             "I just read the Tell Tale Heart. You've got problems man.\n"
             '\n'
             'Sincerely,\n'
             'Marky Mark',
  'subject': 'RE:Hello!',
  'to': "'Edgar Allen Poe' <eap@gmail.com>"},
 {'from': "'Edgar Allen Poe' <eap@gmail.com>",
  'message': 'Mark,\n'
             '\n'
             'The world is crushing my soul, and so are you.\n'
             '\n'
             'Regards,\n'
             'Edgar',
  'subject': 'RE: Hello!',
  'to': "'Mark Twain' <mark.twain@gmail.com>"}]
票数 3
EN

Stack Overflow用户

发布于 2018-08-04 23:48:10

如果您想要实现的只是解析包含标准格式电子邮件的字符串,那么可以使用email.parser module;它是标准库的一部分。

您仍然需要在较大的文本中分隔电子邮件,但From: ...头可以帮助实现这一点,使用正则表达式:

代码语言:javascript
复制
import re
from email import parser, policy

email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')

parser = parser.Parser(policy=policy.default)

for email_text in email_start.split(text):
    message = parser.parsestr(email_text)
    to, from_ = message['to'], message['from']
    body = message.get_payload()
    # do something with the email details

正则表达式匹配前面紧跟另一个换行符(因此有一个空行)、文本From:和至少一个空格(因此下一行看起来像电子邮件From:标题)的任何换行符。

试图通过删除或替换标点符号来获取这些相同的部分并不是获取相同信息的非常有效的方法,即使您正确使用了这些工具。

演示:

代码语言:javascript
复制
>>> import re
>>> from email import parser, policy
>>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
>>> parser = parser.Parser(policy=policy.default)
>>> for email_text in email_start.split(text):
...     message = parser.parsestr(email_text)
...     to, from_ = message['to'], message['from']
...     body = message.get_payload()
...     print('Email from:', from_)
...     print('Email to:', to)
...     print('Third line:', body.splitlines(True)[2])
...
Email from: 'Mark Twain' <mark.twain@gmail.com>
Email to: 'Edgar Allen Poe' <eap@gmail.com>
Third line: I just read the Tell Tale Heart. You've got problems man.

Email from: 'Edgar Allen Poe' <eap@gmail.com>
Email to: 'Mark Twain' <mark.twain@gmail.com>
Third line: The world is crushing my soul, and so are you.
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51676027

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档