首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从html中提取嵌套字典

从html中提取嵌套字典
EN

Stack Overflow用户
提问于 2021-09-14 15:30:24
回答 1查看 44关注 0票数 1

我有一个如图所示的html文件:kegg mapper result,我想构建一个包含三列的表:“with”、"KO“和"Query":"pathway”列将包含"01100代谢途径“,"KO”列应包含"K00166“,"Query”应包含"Trinity_GG_60253_c0_g1_i9.p2“这里是html源文件

代码语言:javascript
运行
复制
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0050)https://www.genome.jp/kegg-bin/find_pathway_object -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><title>KEGG Mapper Reconstruction Result</title>
<meta name="http-equiv" content="Content-Type">
<script type="text/javascript" src="./KEGGResult_files/jquery.min.js.download"></script>
<script type="text/javascript" src="./KEGGResult_files/jquery-ui.min.js.download"></script>
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/jquery-ui.css">
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/mapper2.css">
<script language="JavaScript">
<!---

</style></head>
<body>
<h3>KEGG Mapper Reconstruction Result</h3>
<div class="box1">
<ul class="menu">
<form method="POST" name="form2">
<li class="on"><a href="https://www.genome.jp/kegg-bin/find_pathway_object#">Pathway (4)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_brite_object&#39;,2)">Brite (2)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_britetable_object&#39;,1)">Brite Table (1)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_module_object&#39;,0)">Module (0)</a></li>
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
</ul>
</div>
<div class="box2">
<form method="POST" name="form1" action="https://www.genome.jp/kegg-bin/find_pathway_object">
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="sort" value="object">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
<p>
</p><div id="all_status"><a href="javascript:display_all(&#39;none&#39;)">Hide matched objects</a></div>
<p>
</p><div id="list">
<!-- -->
<b>Metabolism</b>
<ul>
 Global and overview maps
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01100.coords+reference" target="_blank">01100</a> Metabolic pathways&nbsp;(<a href="javascript:display(&#39;map01100&#39;)">1</a>)
<div id="objectmap01100" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li><li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01110.coords+reference" target="_blank">01110</a> Biosynthesis of secondary metabolites&nbsp;(<a href="javascript:display(&#39;map01110&#39;)">1</a>)
<div id="objectmap01110" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li>  </ul>
 Carbohydrate metabolism
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00640.coords+reference" target="_blank">00640</a> Propanoate metabolism&nbsp;(<a href="javascript:display(&#39;map00640&#39;)">1</a>)
<div id="objectmap00640" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li>  </ul>
 Amino acid metabolism
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00280.coords+reference" target="_blank">00280</a> Valine, leucine and isoleucine degradation&nbsp;(<a href="javascript:display(&#39;map00280&#39;)">1</a>)
<div id="objectmap00280" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li></ul></ul></div></div>
</body></html>

EN

回答 1

Stack Overflow用户

发布于 2021-09-14 16:58:16

我已经将你的数据作为HTML,并根据标签来查找文本

第一个是新陈代谢。它在标签之外,所以我找到了它的下一个标签,然后是方法中的前一个文本

代码语言:javascript
运行
复制
soup=BeautifulSoup(html,"lxml")

data=soup.find_all("li")
lst=[]
for i in data:
    data_lst=[]
    data_lst.append(i.find("a").find_next().previous.replace("("," "))
    data_lst.append(i.find("dt").get_text())
    data_lst.append(i.find("dd").get_text())
    lst.append(data_lst)
    
import pandas as pd
df=pd.DataFrame(columns=["Pathway","KO","Query"],data=lst)

输出:

代码语言:javascript
运行
复制
    Pathway                                     KO      Query
0   Metabolic pathways                          K00166  Trinity_GG_60253_c0_g1_i9.p2
1   Biosynthesis of secondary metabolites       K00166  Trinity_GG_60253_c0_g1_i9.p2
2   Propanoate metabolism                       K00166  Trinity_GG_60253_c0_g1_i9.p2
3   Valine, leucine and isoleucine degradation  K00166  Trinity_GG_60253_c0_g1_i9.p2
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69180618

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档