我正在尝试制作一个Python程序,它可以用Beautiful收集基因HTML信息,但是我在制作URL时经常会出错。我的代码是:
# import library for requests
import urllib.request as urllib
# import library for reading html /
from bs4 import BeautifulSoup
def fresh_soup(url):
'''
Collects and parses the page source from a given url, returns the parsed page source
- url : the url you wish to scrape
'''
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.Request(url,headers=hdr)
source = urllib.urlopen(req,timeout=10).read()
soup = BeautifulSoup(source,"lxml")
return soup
###
import csv
result = []
for line in open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt"):
result.append(line.split('/t'))
csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
for gene in csv.readline().split('/t'):
url = 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'
def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'
genes_urls = [build_url(gene) for gene in csv]
print(genes_urls)
import requests
for url in genes_urls:
r = requests.get(url)
import urllib.request
for url in genes_urls:
with urllib.request.urlopen(url) as response:
html = response.read()
soup = fresh_soup(url)
result = soup.find_all('pre')
result = result[0]
result = result.text
results +=[result]
我一直得到urllib.error.HTTPError: HTTP错误400:糟糕的请求,尽管生成的每个单独的URL(打印后将它们复制到浏览器中时)似乎都能工作。这就是他们的样子:
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释
* ['https://www.ncbi.nlm.nih.gov/nuccore/AY348795\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348740\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348741\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348742\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776060\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776010\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776113\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348743\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776061\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776011\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776114\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348745\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147811\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776115\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348746\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147812\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776116\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348747\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147814\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348748\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147815\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776062\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776012\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776117\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348749\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147816\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348750\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147818\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776118\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348751\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348752\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147819\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348753\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147820\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348754\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147821\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776119\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776063\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776013\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776120\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348755\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348756\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348757\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348758\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147822\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348759\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147823\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776064\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776014\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776121\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348761\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147825\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776122\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776065\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776015\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776123\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776066\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776016\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776124\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776067\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776017\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348763\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776068\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776018\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776125\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348764\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147828\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348765\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147829\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776126\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348766\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147830\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348767\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147831\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776127\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348768\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348769\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147832\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348770\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147833\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348771\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147834\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776069\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776019\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776128\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776070\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776020\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776129\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348773\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147836\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776130\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348774\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147837\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348776\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147838\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776071\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776021\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776131\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348777\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348778\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147841\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776132\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776072\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776022\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776133\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348780\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348781\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147843\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348782\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147844\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348783\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147846\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348784\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147847\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776073\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776023\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776134\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348785\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348786\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348787\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776074\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776024\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776135\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776075\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776025\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776136\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348790\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348791\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147849\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348792\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043642\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348793\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776076\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776027\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776077\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776028\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776137\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348796\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147851\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348797\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147852\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348798\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776029\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776138\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348799\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147853\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348800\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776078\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776030\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776079\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776031\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776139\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348802\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147855\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776080\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776032\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776140\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348803\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147856\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348804\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776081\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776033\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776141\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776082\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776034\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776142\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776083\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776035\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776143\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348805\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776084\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776036\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348806\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147858\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348807\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147859\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776085\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776144\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348809\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348810\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147860\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776086\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776037\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776145\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348811\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147861\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776146\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776087\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776038\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776147\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348812\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147862\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776088\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776039\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776148\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776089\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776040\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776149\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776090\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776041\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776150\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776091\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776042\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776151\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776092\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147864\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776152\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348814\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147865\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348815\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348816\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147866\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348817\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776094\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776153\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776093\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776043\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348818\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147867\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776154\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348819\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147868\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776095\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776044\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776155\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348820\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147870\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776096\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776026\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776156\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348821\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776045\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776157\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348822\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147871\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776097\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776046\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776158\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348823\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147872\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776098\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147873\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776159\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348824\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776047\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776160\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348825\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147874\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348827\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348828\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147876\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776161\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348829\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776099\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147877\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776162\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776100\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776048\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776163\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348830\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147878\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776101\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776049\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348832\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147879\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348833\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147880\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776164\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776102\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776050\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776165\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348835\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147881\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348836\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348837\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776103\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776051\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776166\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776104\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776052\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776167\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348838\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147882\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348839\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348840\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147883\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348841\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776168\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776105\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776053\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776169\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/FJ826677\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776106\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776054\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776170\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348843\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147885\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776107\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776055\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776171\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348844\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147886\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776108\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776056\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776172\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348845\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147887\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776173\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776109\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043527\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776174\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348847\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147890\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348848\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348849\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147892\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043641\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776110\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776057\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776175\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776111\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776058\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776176\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348850\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147893\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776112\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776059\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776177\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348852\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348853\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147895\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147897\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348855\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147898.1?report=fasta']
*/
我能做些什么才能使它正确地请求URL和刮伤?
发布于 2018-08-09 13:40:27
在你的网址里你有\n这需要去掉。HTML中没有pre,所以在这个示例中,我找到了第二个h1标记来测试。
import requests
from bs4 import BeautifulSoup
# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'
csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]
results = []
for url in genes_urls:
r = requests.get(url)
# Using html.parser but you can use lxml if you like.
soup = BeautifulSoup(r.text,"html.parser")
# there is no <pre> tag in the soup so we will find the second occurrence of H1 for testing.
result = soup.find_all('h1')[1].text
print (result)
results +=[result]
print (results)
产出:
Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andohahelae internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andringitrensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens angulata voucher S.X. Yu 3777 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
Impatiens angulata voucher S.X. Yu 3777 atpB-rbcL intergenic spacer, partial sequence; chloroplast
Impatiens angulata voucher S.X. Yu 3777 tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene, partial sequence; plastid
Impatiens anovensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens apalophylla voucher S.X. Yu 4042 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
....
更新由JavaScript发出的XHR请求生成pre标记。你可以这样模拟。
import requests
from bs4 import BeautifulSoup
# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'
csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]
results = []
for url in genes_urls:
r = requests.get(url)
# Using html.parser but you can use lxml if you like.
soup = BeautifulSoup(r.text,"html.parser")
# You need to get the vale of content in <meta content="38155510" name="ncbi_uidlist"/>
content = soup.find('meta', {'name':"ncbi_uidlist"})['content']
# Simulate the XHR request using "content"
result = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=" + content + "&db=nuccore&report=fasta&extrafeat=null&conwithfeat=on&retmode=ht").text
print (result)
results +=[result]
print (results)
产出:
>AY348795.1 Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TCGAAAACTATTTCAAACAACCAGTGAACATAATAATAAATCTTGTGTTGAGATTGACTTTTGTTTAATC
TCTTCCTATTAATGTACTTGGAGTGCTTGCTTGGCAACAAATTTGTATGCCATTTTGTAGGTTCCCTCAA
CTCATAAACAAACCCCGGCGTAAACCGCCAAGGAATGTTAAAAACAATTGCCATTATTTTACCCATTTAT
ATGGGATGAAATTTTGGTTTTAGTTATCAATAAACTAAAATGACTCTCGACAACGGATATCTCGGCTCTC
GCATCGATGAAGAACGTAGCAAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTTT
TTGAACCCAAGTTGCGCCTGAAGCTATTAGGTTGAAGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGT
CTCATTTCATCTATTATGGGACGGATAATGGCCTCCTGTACGTTTATATATCGAGCAGTTGGTTGAAATA
TAAGTCCATATTATAGGACACACGGTTAGTGGTGGTTGAAAAAACTGTTTCAAACCCGTGTTGTAACTTA
ATTTGGATTGATTGACCCTTCTTGTGCCTTTAATGGTGCATCGTTTGC
>AY348740.1 Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TTCATCACCGNCGAACTTGTTATTAAAATCGGGCTGCGATTGGCCTTTGGNCGGTCGCTTCCCATCATGC
GGTTGGGGTGCACGGTGTTGTATTCTATCTTGGGTACAATCGCGTGTTCCCCCNACTCATAAACAAACCC
CGGCGTAAACCGCCAAGGAATGTTAAAAAGGACTTCCCATACCAGACCCATTTTATTTTTGGGGGATGCG
TAATGGTGTTAGTTTTCCATAAACATAACGACTCTCGACAACGGATATCTCGGCTCTCGCATCGATGAAG
AACGTAGCAAAATGCGATACTTGGTGTGAATTGCARAATTCCCGTGAACCATCGAGTTTTTGAACGCAAG
TTGCGCCTGAAGCCATTAGGTTGAGGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGCCCCATTTCATA
ACTGTTTTGGGACGTATAATGGCCTCCTGTGCAATACCCATGCAGCAGTTGGCCGAAATAGAAGTCCATA
TGATAGGACACACGGTTAGTGGTGGTTGARAAACTGTTTC
...
发布于 2018-08-09 01:34:06
试试这个:
import requests, bs4
results = []
for url in genes_urls:
html = requests.get(url.strip(), headers={'User-Agent': 'Mozilla/5.0'}).text
soup = bs4.BeautifulSoup(html, "lxml")
results += [soup.find('pre').text]
并请删除所有其他网络代码,因为它只是可怕的副本粘贴在不同的位置。使用上面的代码从urls列表接收数据。
https://stackoverflow.com/questions/51757421
复制相似问题