我刚开始使用XML和BeautifulSoup,我正在尝试使用ClinicalTrials.gov的新API获得一个临床试验数据集,该API将试验列表转换为XML。我试着使用find_all(),就像我通常使用HTML一样,但我的运气不一样。我尝试过其他一些方法,比如转换成字符串和拆分(非常混乱),但是我不想让我的代码被失败的尝试弄得乱七八糟。
的底线:我想提取所有的NCTIds (我知道我可以把整个东西转换成一个字符串并使用regex,但是我想学习如何正确地解析NCTIds)和每个临床试验的官方标题。任何帮助都是非常感谢的!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results发布于 2021-11-17 21:21:04
您可以对以下属性进行筛选:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})然后对每个结果进行迭代以获得文本,例如:
official_titles = [result.text for result in m1_officialtitle]有关更多信息,您可以查看文档这里。
发布于 2021-11-17 21:01:35
您可以搜索小写的field标记,并将name作为属性传递给attrs。这只适用于BeautifulSoup,不需要使用etree
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})https://stackoverflow.com/questions/70011389
复制相似问题