首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用python将html数据转换为json

使用python将html数据转换为json
EN

Stack Overflow用户
提问于 2022-09-21 23:18:03
回答 2查看 112关注 0票数 0

我尝试使用下面的代码将html文件数据转换为json。

代码语言:javascript
运行
复制
import html_to_json
import json
def htmltojson():
    with open("C:\Extraction\Sample.html", "r") as html_file:
        html = html_file.read()
        output_json = html_to_json.convert(html,capture_element_attributes=False,capture_element_values=True)
    with open('Final.json', 'w') as outfile:
        json.dump(output_json, outfile,indent=4)
        print(output_json)

我得到的json包含html、span和其他标记,我只想要键和它的值。

Json输出我得到了

代码语言:javascript
运行
复制
{
    "html": [
        {
            "head": [
                {
                    "meta": [
                        {},
                        {},
                        {},
                        {}
                    ],
                    "link": [
                        {},
                        {},
                        {},
                        {}
                    ],
                    "title": [
                        {
                            "_value": "252"
                        }
                    ],
                    "_values": [
                        "[if gte mso 9]><xml>\n <o:DocumentProperties>\n  <o:Author>Sharon Kaufmann</o:Author>\n  <o:Template>Normal</o:Template>\n  <o:LastAuthor>Aman Pawar</o:LastAuthor>\n  <o:Revision>2</o:Revision>\n  <o:TotalTime>339</o:TotalTime>\n  <o:LastPrinted>2019-11-07T16:41:00Z</o:LastPrinted>\n  <o:Created>2022-09-21T22:16:00Z</o:Created>\n  <o:LastSaved>2022-09-21T22:16:00Z</o:LastSaved>\n  <o:Pages>1</o:Pages>\n  <o:Words>1756</o:Words>\n  <o:Characters>10014</o:Characters>\n  <o:Company>AMS Inc</o:Company>\n  <o:Lines>83</o:Lines>\n  <o:Paragraphs>23</o:Paragraphs>\n  <o:CharactersWithSpaces>11747</o:CharactersWithSpaces>\n  <o:Version>16.00</o:Version>\n </o:DocumentProperties>\n <o:CustomDocumentProperties>\n  <o:_NewReviewCycle dt:dt=\"string\"></o:_NewReviewCycle>\n </o:CustomDocumentProperties>\n <o:OfficeDocumentSettings>\n  <o:RelyOnVML/>\n  <o:AllowPNG/>\n </o:OfficeDocumentSettings>\n</xml><![endif]",
                        "[if gte mso 9]><xml>\n <w:WordDocument>\n  <w:DocumentProtectionNotEnforced>ReadOnly</w:DocumentProtectionNotEnforced>\n  <w:TrackMoves/>\n  <w:TrackFormatting/>\n  <w:DoNotHyphenateCaps/>\n  <w:PunctuationKerning/>\n  <w:DrawingGridHorizontalSpacing>5 pt</w:DrawingGridHorizontalSpacing>\n  <w:DrawingGridVerticalSpacing>6 pt</w:DrawingGridVerticalSpacing>\n  <w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>\n  <w:DisplayVerticalDrawingGridEvery>3</w:DisplayVerticalDrawingGridEvery>\n  <w:ValidateAgainstSchemas/>\n  <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>\n  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>\n  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>\n  <w:DoNotPromoteQF/>\n  <w:LidThemeOther>EN-US</w:LidThemeOther>\n  <w:LidThemeAsian>X-NONE</w:LidThemeAsian>\n  <w:LidThemeComplexScript>AR-SA</w:LidThemeComplexScript>\n  <w:Compatibility>\n   <w:BreakWrappedTables/>\n   <w:SnapToGridInCell/>\n   <w:WrapTextWithPunct/>\n   <w:UseAsianBreakRules/>\n   <w:DontGrowAutofit/>\n   <w:SplitPgBreakAndParaMark/>\n   <w:EnableOpenTypeKerning/>\n   <w:DontFlipMirrorIndents/>\n   <w:OverrideTableStyleHps/>\n  </w:Compatibility>\n  <m:mathPr>\n   <m:mathFont m:val=\"Cambria Math\"/>\n   <m:brkBin m:val=\"before\"/>\n   <m:brkBinSub m:val=\"&#45;-\"/>\n   <m:smallFrac m:val=\"off\"/>\n   <m:dispDef/>\n   <m:lMargin m:val=\"0\"/>\n   <m:rMargin m:val=\"0\"/>\n   <m:defJc m:val=\"centerGroup\"/>\n   <m:wrapIndent m:val=\"1440\"/>\n   <m:intLim m:val=\"subSup\"/>\n   <m:naryLim m:val=\"undOvr\"/>\n  </m:mathPr></w:WordDocument>\n</xml><![endif]",],
            "body": [
                {
                    "div": [
                        {
                            "p": [
                                {
                                    "a": [
                                        {},
                                        {},
                                        {
                                            "span": [
                                                {
                                                    "span": [
                                                        {
                                                            "span": [
                                                                {
                                                                    "_value": "Performance Work Statement"
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "span": [
                                        {
                                            "span": [
                                                {
                                                    "span": [
                                                        {
                                                            "span": [
                                                                {
                                                                    "_value": "UNITED STATES NAVAL ACADEMY (USNA)"
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },

预期的输出是某种形式。

示例预期格式

[{“键”:"1“、”值“:”子“:”[] }、{“键”:"2“、”值“:”子“:[{”键“:"2.1”、“值”:“子”:[] }、{“键”:"2.2“、”值“:”子“:”子“:[ }{“键”:"3",“值”:“子”:[{“键”:"2.1",“值”:“子”:[{“键”:"2.1.1",“值”:"“子”:[ } ]}}]

EN

回答 2

Stack Overflow用户

发布于 2022-09-21 23:22:21

你试过这样的东西吗?只需查找要查找的元素?https://www.w3schools.com/python/gloss_python_json_parse.asp即可。

蟒蛇的文档可能也有帮助..。https://docs.python.org/3/library/json.html

请问您为什么要将HTML编码为JSON?

票数 0
EN

Stack Overflow用户

发布于 2022-10-04 13:28:11

--好吧,如果有人想要一个解决方案,我用下面的逻辑来解决它

代码语言:javascript
运行
复制
from html_to_draftjs import html_to_draftjs
import bleach,json
from bleach.css_sanitizer import CSSSanitizer

def htmltodraftjson():
    with open("WorkStatement.html", "r") as html_file:
        html = html_file.read()
        
        css_sanitizer = CSSSanitizer(allowed_css_properties=["color", "font-weight"])
        output_json = html_to_draftjs(bleach.clean(html,tags=['div', 'p', 'ul', 'ol', 'blockquote', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'strong', 'b', 'em', 'i', 'img', 'a', 'br'], strip=True))
    with open('WorkStatement.json', 'w') as outfile:
        json.dump(output_json, outfile,indent=4)
        print(output_json)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73807822

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档