我正在尝试创建一个CSV文件,其中包括所有的蛋白质名称、它们的PDB (蛋白质数据库)I以及基于RSPB的高级搜索查询的实验方法。有444个搜索结果,我想创建一个整洁的CSV文件。搜索的这是链接。
我编写了下面的脚本来提取关于第一个搜索结果的信息,但是输出显示“无”。
import requests
from bs4 import BeautifulSoup
source = requests.get(url) # url is same as mentioned above
soup = BeautifulSoup(source.text, 'lxml')
item1 = soup.find('div', class_='row results-item')
页面的HTML代码似乎是高度嵌套和混乱的。
TL;博士我试图在一个csv中获得以下内容,但是这个HTML是高度嵌套的:(
( 1) PDB ID (4位字母数字编码) 2)蛋白质复合体名称(Ex :FKBP51的FKBP51结构域.) 3) X射线衍射、核磁共振等方法
任何帮助或建议都将不胜感激!
(预先谢谢:)
发布于 2020-06-03 16:02:57
其实你不能用BeautifulSoup刮这种网站.这个网站使用内部cdn在webpage..However上呈现数据,我想出了一个解决方案,以JSON格式获取数据。
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P
Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/80.0.3987.162 Mobile Safari/537.36"
}
payload = {
"query":{"type":"group","logical_operator":"and","nodes":
[{"type":"group","logical_operator":"and","nodes":
[{"type":"group","logical_operator":"and","nodes":
[{"type":"terminal","service":"text","parameters":
{"negation":False,"value":"plasmodium falciparum"},"node_id":0},
{"type":"group","logical_operator":"and","nodes":
[{"type":"terminal","service":"text","parameters": {"operator":"exact_match","negation":False,"value":"Homosapiens","attribute":"rcsb_entity_source_organism.ncbi_scientific_name"},"node_id":1}]}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}],"pager":{"start":0,"rows":100}},"request_info":{"src":"ui","query_id":"6878ab86935e083352a6914232c8b2e5"}}
response = requests.post('https://www.rcsb.org/search/data', headers=headers,
json=data)
print(response.json())
您也可以使用有效载荷值来操作responses..Hope,这将帮助您!!
https://stackoverflow.com/questions/62176592
复制相似问题