我尝试使用下面的代码将html文件数据转换为json。
import html_to_json
import json
def htmltojson():
with open("C:\Extraction\Sample.html", "r") as html_file:
html = html_file.read()
output_json = html_to_json.convert(html,capture_element_attributes=False,capture_element_values=True)
with open('Final.json', 'w') as outfile:
json.dump(output_json, outfile,indent=4)
print(output_json)
我得到的json包含html、span和其他标记,我只想要键和它的值。
Json输出我得到了
{
"html": [
{
"head": [
{
"meta": [
{},
{},
{},
{}
],
"link": [
{},
{},
{},
{}
],
"title": [
{
"_value": "252"
}
],
"_values": [
"[if gte mso 9]><xml>\n <o:DocumentProperties>\n <o:Author>Sharon Kaufmann</o:Author>\n <o:Template>Normal</o:Template>\n <o:LastAuthor>Aman Pawar</o:LastAuthor>\n <o:Revision>2</o:Revision>\n <o:TotalTime>339</o:TotalTime>\n <o:LastPrinted>2019-11-07T16:41:00Z</o:LastPrinted>\n <o:Created>2022-09-21T22:16:00Z</o:Created>\n <o:LastSaved>2022-09-21T22:16:00Z</o:LastSaved>\n <o:Pages>1</o:Pages>\n <o:Words>1756</o:Words>\n <o:Characters>10014</o:Characters>\n <o:Company>AMS Inc</o:Company>\n <o:Lines>83</o:Lines>\n <o:Paragraphs>23</o:Paragraphs>\n <o:CharactersWithSpaces>11747</o:CharactersWithSpaces>\n <o:Version>16.00</o:Version>\n </o:DocumentProperties>\n <o:CustomDocumentProperties>\n <o:_NewReviewCycle dt:dt=\"string\"></o:_NewReviewCycle>\n </o:CustomDocumentProperties>\n <o:OfficeDocumentSettings>\n <o:RelyOnVML/>\n <o:AllowPNG/>\n </o:OfficeDocumentSettings>\n</xml><![endif]",
"[if gte mso 9]><xml>\n <w:WordDocument>\n <w:DocumentProtectionNotEnforced>ReadOnly</w:DocumentProtectionNotEnforced>\n <w:TrackMoves/>\n <w:TrackFormatting/>\n <w:DoNotHyphenateCaps/>\n <w:PunctuationKerning/>\n <w:DrawingGridHorizontalSpacing>5 pt</w:DrawingGridHorizontalSpacing>\n <w:DrawingGridVerticalSpacing>6 pt</w:DrawingGridVerticalSpacing>\n <w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>\n <w:DisplayVerticalDrawingGridEvery>3</w:DisplayVerticalDrawingGridEvery>\n <w:ValidateAgainstSchemas/>\n <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>\n <w:IgnoreMixedContent>false</w:IgnoreMixedContent>\n <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>\n <w:DoNotPromoteQF/>\n <w:LidThemeOther>EN-US</w:LidThemeOther>\n <w:LidThemeAsian>X-NONE</w:LidThemeAsian>\n <w:LidThemeComplexScript>AR-SA</w:LidThemeComplexScript>\n <w:Compatibility>\n <w:BreakWrappedTables/>\n <w:SnapToGridInCell/>\n <w:WrapTextWithPunct/>\n <w:UseAsianBreakRules/>\n <w:DontGrowAutofit/>\n <w:SplitPgBreakAndParaMark/>\n <w:EnableOpenTypeKerning/>\n <w:DontFlipMirrorIndents/>\n <w:OverrideTableStyleHps/>\n </w:Compatibility>\n <m:mathPr>\n <m:mathFont m:val=\"Cambria Math\"/>\n <m:brkBin m:val=\"before\"/>\n <m:brkBinSub m:val=\"--\"/>\n <m:smallFrac m:val=\"off\"/>\n <m:dispDef/>\n <m:lMargin m:val=\"0\"/>\n <m:rMargin m:val=\"0\"/>\n <m:defJc m:val=\"centerGroup\"/>\n <m:wrapIndent m:val=\"1440\"/>\n <m:intLim m:val=\"subSup\"/>\n <m:naryLim m:val=\"undOvr\"/>\n </m:mathPr></w:WordDocument>\n</xml><![endif]",],
"body": [
{
"div": [
{
"p": [
{
"a": [
{},
{},
{
"span": [
{
"span": [
{
"span": [
{
"_value": "Performance Work Statement"
}
]
}
]
}
]
}
]
},
{
"span": [
{
"span": [
{
"span": [
{
"span": [
{
"_value": "UNITED STATES NAVAL ACADEMY (USNA)"
}
]
}
]
}
]
}
]
},
预期的输出是某种形式。
示例预期格式
[{“键”:"1“、”值“:”子“:”[] }、{“键”:"2“、”值“:”子“:[{”键“:"2.1”、“值”:“子”:[] }、{“键”:"2.2“、”值“:”子“:”子“:[ }{“键”:"3",“值”:“子”:[{“键”:"2.1",“值”:“子”:[{“键”:"2.1.1",“值”:"“子”:[ } ]}}]
发布于 2022-09-21 23:22:21
你试过这样的东西吗?只需查找要查找的元素?https://www.w3schools.com/python/gloss_python_json_parse.asp即可。
蟒蛇的文档可能也有帮助..。https://docs.python.org/3/library/json.html
请问您为什么要将HTML编码为JSON?
发布于 2022-10-04 13:28:11
--好吧,如果有人想要一个解决方案,我用下面的逻辑来解决它
from html_to_draftjs import html_to_draftjs
import bleach,json
from bleach.css_sanitizer import CSSSanitizer
def htmltodraftjson():
with open("WorkStatement.html", "r") as html_file:
html = html_file.read()
css_sanitizer = CSSSanitizer(allowed_css_properties=["color", "font-weight"])
output_json = html_to_draftjs(bleach.clean(html,tags=['div', 'p', 'ul', 'ol', 'blockquote', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'strong', 'b', 'em', 'i', 'img', 'a', 'br'], strip=True))
with open('WorkStatement.json', 'w') as outfile:
json.dump(output_json, outfile,indent=4)
print(output_json)
https://stackoverflow.com/questions/73807822
复制相似问题