首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用正确的请求抓取漂亮汤- HTML错误400

用正确的请求抓取漂亮汤- HTML错误400
EN

Stack Overflow用户
提问于 2018-08-09 00:48:23
回答 2查看 1.4K关注 0票数 2

我正在尝试制作一个Python程序,它可以用Beautiful收集基因HTML信息,但是我在制作URL时经常会出错。我的代码是:

代码语言:javascript
运行
复制
# import library for requests 
import urllib.request as urllib
# import library for reading html /
from bs4 import BeautifulSoup

def fresh_soup(url):    
    '''
    Collects and parses the page source from a given url, returns the parsed page source 
    - url : the url you wish to scrape
    '''
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib.Request(url,headers=hdr) 
    source = urllib.urlopen(req,timeout=10).read() 
    soup = BeautifulSoup(source,"lxml")  

    return soup
    ###


import csv

result = []
for line in open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt"):
    result.append(line.split('/t'))

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
for gene in csv.readline().split('/t'):
    url = 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'


def build_url(gene):
    return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'

genes_urls = [build_url(gene) for gene in csv]


print(genes_urls)

import requests

for url in genes_urls:
    r = requests.get(url)

import urllib.request
for url in genes_urls:
    with urllib.request.urlopen(url) as response:
        html = response.read()
    soup = fresh_soup(url)
    result = soup.find_all('pre')
    result = result[0]
    result = result.text
    results +=[result]

我一直得到urllib.error.HTTPError: HTTP错误400:糟糕的请求,尽管生成的每个单独的URL(打印后将它们复制到浏览器中时)似乎都能工作。这就是他们的样子:

代码语言:javascript
运行
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* ['https://www.ncbi.nlm.nih.gov/nuccore/AY348795\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348740\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348741\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348742\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776060\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776010\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776113\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348743\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776061\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776011\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776114\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348745\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147811\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776115\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348746\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147812\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776116\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348747\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147814\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348748\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147815\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776062\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776012\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776117\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348749\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147816\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348750\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147818\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776118\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348751\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348752\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147819\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348753\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147820\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348754\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147821\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776119\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776063\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776013\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776120\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348755\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348756\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348757\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348758\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147822\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348759\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147823\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776064\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776014\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776121\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348761\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147825\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776122\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776065\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776015\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776123\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776066\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776016\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776124\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776067\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776017\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348763\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776068\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776018\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776125\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348764\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147828\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348765\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147829\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776126\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348766\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147830\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348767\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147831\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776127\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348768\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348769\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147832\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348770\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147833\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348771\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147834\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776069\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776019\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776128\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776070\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776020\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776129\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348773\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147836\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776130\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348774\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147837\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348776\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147838\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776071\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776021\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776131\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348777\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348778\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147841\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776132\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776072\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776022\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776133\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348780\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348781\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147843\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348782\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147844\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348783\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147846\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348784\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147847\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776073\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776023\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776134\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348785\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348786\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348787\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776074\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776024\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776135\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776075\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776025\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776136\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348790\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348791\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147849\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348792\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043642\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348793\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776076\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776027\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776077\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776028\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776137\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348796\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147851\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348797\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147852\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348798\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776029\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776138\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348799\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147853\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348800\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776078\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776030\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776079\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776031\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776139\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348802\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147855\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776080\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776032\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776140\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348803\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147856\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348804\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776081\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776033\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776141\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776082\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776034\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776142\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776083\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776035\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776143\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348805\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776084\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776036\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348806\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147858\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348807\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147859\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776085\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776144\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348809\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348810\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147860\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776086\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776037\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776145\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348811\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147861\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776146\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776087\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776038\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776147\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348812\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147862\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776088\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776039\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776148\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776089\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776040\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776149\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776090\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776041\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776150\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776091\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776042\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776151\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776092\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147864\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776152\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348814\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147865\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348815\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348816\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147866\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348817\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776094\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776153\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776093\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776043\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348818\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147867\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776154\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348819\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147868\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776095\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776044\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776155\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348820\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147870\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776096\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776026\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776156\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348821\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776045\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776157\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348822\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147871\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776097\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776046\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776158\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348823\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147872\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776098\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147873\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776159\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348824\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776047\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776160\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348825\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147874\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348827\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348828\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147876\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776161\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348829\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776099\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147877\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776162\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776100\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776048\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776163\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348830\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147878\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776101\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776049\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348832\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147879\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348833\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147880\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776164\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776102\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776050\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776165\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348835\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147881\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348836\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348837\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776103\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776051\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776166\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776104\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776052\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776167\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348838\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147882\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348839\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348840\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147883\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348841\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776168\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776105\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776053\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776169\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/FJ826677\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776106\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776054\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776170\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348843\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147885\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776107\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776055\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776171\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348844\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147886\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776108\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776056\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776172\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348845\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147887\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776173\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776109\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043527\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776174\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348847\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147890\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348848\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348849\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147892\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AB043641\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776110\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776057\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776175\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776111\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776058\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776176\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348850\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147893\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776112\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776059\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/KP776177\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348852\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348853\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147895\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147897\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/AY348855\n.1?report=fasta', 'https://www.ncbi.nlm.nih.gov/nuccore/DQ147898.1?report=fasta']
*/

我能做些什么才能使它正确地请求URL和刮伤?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-08-09 13:40:27

在你的网址里你有\n这需要去掉。HTML中没有pre,所以在这个示例中,我找到了第二个h1标记来测试。

代码语言:javascript
运行
复制
import requests
from bs4 import BeautifulSoup

# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
    return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]

results = []
for url in genes_urls:
    r = requests.get(url)
    # Using html.parser but you can use lxml if you like.
    soup = BeautifulSoup(r.text,"html.parser") 
    # there is no <pre> tag in the soup so we will find the second occurrence of H1 for testing.
    result = soup.find_all('h1')[1].text
    print (result)
    results +=[result]

print (results)

产出:

代码语言:javascript
运行
复制
Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andohahelae internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens andringitrensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens angulata voucher S.X. Yu 3777 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
Impatiens angulata voucher S.X. Yu 3777 atpB-rbcL intergenic spacer, partial sequence; chloroplast
Impatiens angulata voucher S.X. Yu 3777 tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene, partial sequence; plastid
Impatiens anovensis internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
Impatiens apalophylla voucher S.X. Yu 4042 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
....

更新由JavaScript发出的XHR请求生成pre标记。你可以这样模拟。

代码语言:javascript
运行
复制
import requests
from bs4 import BeautifulSoup

# In your function you need to strip out "\n" as it has no place in your URLs.
def build_url(gene):
    return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene.rstrip() + '.1?report=fasta'

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
genes_urls = [build_url(gene) for gene in csv]

results = []
for url in genes_urls:
    r = requests.get(url)
    # Using html.parser but you can use lxml if you like.
    soup = BeautifulSoup(r.text,"html.parser")
    # You need to get the vale of content in <meta content="38155510" name="ncbi_uidlist"/>
    content = soup.find('meta', {'name':"ncbi_uidlist"})['content']
    # Simulate the XHR request using "content"
    result = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=" + content + "&db=nuccore&report=fasta&extrafeat=null&conwithfeat=on&retmode=ht").text
    print (result)
    results +=[result]

print (results)

产出:

代码语言:javascript
运行
复制
>AY348795.1 Impatiens amoena internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TCGAAAACTATTTCAAACAACCAGTGAACATAATAATAAATCTTGTGTTGAGATTGACTTTTGTTTAATC
TCTTCCTATTAATGTACTTGGAGTGCTTGCTTGGCAACAAATTTGTATGCCATTTTGTAGGTTCCCTCAA
CTCATAAACAAACCCCGGCGTAAACCGCCAAGGAATGTTAAAAACAATTGCCATTATTTTACCCATTTAT
ATGGGATGAAATTTTGGTTTTAGTTATCAATAAACTAAAATGACTCTCGACAACGGATATCTCGGCTCTC
GCATCGATGAAGAACGTAGCAAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTTT
TTGAACCCAAGTTGCGCCTGAAGCTATTAGGTTGAAGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGT
CTCATTTCATCTATTATGGGACGGATAATGGCCTCCTGTACGTTTATATATCGAGCAGTTGGTTGAAATA
TAAGTCCATATTATAGGACACACGGTTAGTGGTGGTTGAAAAAACTGTTTCAAACCCGTGTTGTAACTTA
ATTTGGATTGATTGACCCTTCTTGTGCCTTTAATGGTGCATCGTTTGC

>AY348740.1 Impatiens amphorata internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
TTCATCACCGNCGAACTTGTTATTAAAATCGGGCTGCGATTGGCCTTTGGNCGGTCGCTTCCCATCATGC
GGTTGGGGTGCACGGTGTTGTATTCTATCTTGGGTACAATCGCGTGTTCCCCCNACTCATAAACAAACCC
CGGCGTAAACCGCCAAGGAATGTTAAAAAGGACTTCCCATACCAGACCCATTTTATTTTTGGGGGATGCG
TAATGGTGTTAGTTTTCCATAAACATAACGACTCTCGACAACGGATATCTCGGCTCTCGCATCGATGAAG
AACGTAGCAAAATGCGATACTTGGTGTGAATTGCARAATTCCCGTGAACCATCGAGTTTTTGAACGCAAG
TTGCGCCTGAAGCCATTAGGTTGAGGGCACGTCTGCCTGGGCGTCTCGCTTCGTGTCGCCCCATTTCATA
ACTGTTTTGGGACGTATAATGGCCTCCTGTGCAATACCCATGCAGCAGTTGGCCGAAATAGAAGTCCATA
TGATAGGACACACGGTTAGTGGTGGTTGARAAACTGTTTC
...
票数 1
EN

Stack Overflow用户

发布于 2018-08-09 01:34:06

试试这个:

代码语言:javascript
运行
复制
import requests, bs4

results = []
for url in genes_urls:
    html = requests.get(url.strip(), headers={'User-Agent': 'Mozilla/5.0'}).text
    soup = bs4.BeautifulSoup(html, "lxml")
    results += [soup.find('pre').text]

并请删除所有其他网络代码,因为它只是可怕的副本粘贴在不同的位置。使用上面的代码从urls列表接收数据。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51757421

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档