文章/答案/技术大牛

发布

问从mbox写入html文件
EN

Stack Overflow用户

提问于 2021-01-25 10:16:50

回答 1查看 158关注 0票数 0

在雅虎组关闭之前，您可以将组的内容下载到mbox文件中。我正在尝试将mbox文件转换为一系列html文件--每个消息一个。我的问题是处理html中的编码和特殊字符。以下是我的尝试：

import mailbox

the_dir = "/path/to/file"

mbox = mailbox.mbox(the_dir + "12394334.mbox")

html_header = """<!DOCTYPE html>
<html>
<head>
<title>Email message</title>
</head>
<body>"""    
html_footer = '</body></html>'

for message in mbox:
    mess_from = message['from']
    subject = message['subject']
    time_received = message['date']
    if message.is_multipart():
        content = ''.join(str(part.get_payload(decode=True)) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    
    content = str(content)[2:].replace('\\n', '<br/>')
    subject.replace('/', '-')
    fname = subject + " " + time_received + '.html'
        
    with open(the_dir + 'html/' + fname , 'w') as the_file:
        the_file.write(html_header)
        the_file.write('<br/>' + 'From: ' + mess_from)
        the_file.write('<br/>' + 'Subject: ' + subject)
        the_file.write('<br/>' + 'Received: ' + time_received + '<br/><br/>')
        the_file.write(content)

消息的内容在撇号和其他特殊字符之前有反斜杠，如：

星级，目前为\xa311.99理想圣诞礼物。如果你在小船上没有一本像样的书，那就做过度的宣传。

我的问题是，获得电子邮件内容并将其写入具有正确字符的html文件的最佳方法是什么。我不能是第一个遇到这个问题的人。

python

html

mbox

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-01-26 13:46:21

我找到了这个问题的答案。

首先，我需要通过子类型(part.get_content_subtype())来标识html。这就是为什么我知道我有一个html子类型。

然后，我需要使用part.get_charsets()获取字符集。有一个part.get_charset()，但它总是返回None，所以我使用了get_charsets()的第一个元素

该get_payload似乎是低音与decode=True参数，这意味着它不会解码的有效载荷。然后，我使用前面得到的字符集解码消息。否则，我将用decode=False对其进行解码。

如果是文本，我去掉linefeed等，添加一个html头，然后写到文件中。

下一份工作

使用BeautifulSoup将发送者信息添加到
中，了解如何处理附件并将html文件链接到它们--
--有些字符仍然没有显示，比如“to”等等，

。

文本

import mailbox

the_dir = "/path/to/mbox/"

mbox = mailbox.mbox(the_dir + "12394334.mbox")

html_footer = "</body></html>"
html_flag = False

for message in mbox:

mess_from = message['from']
subject = message['subject']
time_received = message['date']
fname = subject + " " + time_received
fname = fname.replace('/', '-')

if message.is_multipart():
    contents_text = []
    contents_html = []
    for part in message.walk():
        maintype = part.get_content_maintype()
        subtype = part.get_content_subtype()
        if maintype == 'multipart' or maintype == 'message':
            # Reject containers
            continue
        if subtype == 'html':
            enc = part.get_charsets()
            if enc[0] is not None:
                contents_html.append(part.get_payload(decode=True).decode(enc[0]))
            else:
                contents_html.append(part.get_payload(decode=False))
        elif subtype == 'text':
            contents_text.append(part.get_payload(decode=False))
        else:       #I will use this to process attachmnents in the future
            continue
        
    if len(contents_html)> 0:
        if len(contents_html)>1:
            print('multiple html')      #This hasn't happened yet
        html_flag = True
        content = '\n\n'.join(contents_html)
          
    else:
        html_flag = False
else:
    content = message.get_payload(decode=False) 
    content = content.replace('\\n', '<br/>')
    content = content.replace('=\n', '<br/>')        
    content = content.replace('\n', '<br/>')
    content = content.replace('=20', '')
    html_header = f""" <!DOCTYPE html>
    <html>
    <head>
    <title>{fname}</title>
    </head>
    <body>"""      
    content = (html_header + '<br/>' + 
               'From: ' + mess_from + '<br/>' 
               + 'Subject: ' + subject + '<br/>' + 
               'Received: ' + time_received + '<br/><br/>' + 
               content + html_footer)


with open(the_dir + "html/" + fname + ".html", "w") as the_file:
    the_file.write(content)

打印(“完成！”)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65882780

复制

相似问题

问从mbox写入html文件
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从mbox写入html文件EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从mbox写入html文件
EN