我的名字是Ethan,我正在尝试构建一个API来为开发人员提供技术文件的替罪羊。现在,它只适用于ArXiV,但我非常感谢您对我的回购进行一些指导或代码评审。我是一个新的开发人员,并希望我的代码达到专业的质量。
存储库:https://github.com/evader110/ArXivPully
资料来源还提供了:
from falcon import API
from urllib import request
from bs4 import BeautifulSoup
class ArXivPully:
# Removes rogue newline characters from the title and abstract
def cleanText(self,text):
return ' '.join(text.split('\n'))
def pullFromArXiv(self,search_query, num_results=10):
# Fix Input if it has spaces in it
split_query = search_query.split(' ')
if(len(split_query) > 1):
search_query = '%20'.join(split_query)
url = 'https://export.arxiv.org/api/query?search_query=all:'+search_query+'&start=0&max_results='+str(num_results)
data = request.urlopen(url).read()
output = []
soup = BeautifulSoup(data, 'html.parser')
titles = soup.find_all('title')
# ArXiv populates the first title value as the search query
titles.pop(0)
bodies = soup.find_all('summary')
links = soup.find_all('link', title='pdf')
for i in range(len(titles)):
title = self.cleanText(titles[i].text.strip())
body = self.cleanText(bodies[i].text.strip())
pdf_link = links[i]['href']
output.append([pdf_link, title, body])
return output
def on_get(self, req, resp):
"""Handles GET requests"""
output = []
for item in req.params.items():
output.append(self.pullFromArXiv(item[0],item[1]))
resp.media = output
api = API()
api.add_route('/api/query', ArXivPully())
一些设计解释。我使用Falcon API通过Google平台运行这个API,因为这两个选项对我来说都是免费的,并且是最简单的实现方法。一些已知的问题已经张贴在回购,但我想更好地了解软件开发技能,最佳实践等。我非常感谢任何技巧大小,我期待着大幅改变这个源代码,以使它更健壮。
发布于 2019-04-15 09:46:50
lower_case
。requests.get
而不是urllib.request
。它可以负责为您编写参数。pull_from_arxiv
成为一个生成器来节省几行代码。BeautifulSoup
可以是使用lxml解析器加速.。on_get
。cleanText
是否真的需要。无论如何,我会使用str.replace
而不是str.split
和str.join
。import requests
from bs4 import BeautifulSoup
class ArXivPully:
def pull_from_arxiv(self, search_query, num_results=10):
url = "https://export.arxiv.org/api/query"
params = {"search_query": f"all:{search_query}",
"start": 0,
"max_results": num_results}
data = requests.get(url, params=params).text
soup = BeautifulSoup(data, 'lxml')
# ArXiv populates the first title value as the search query
titles = soup.find_all('title')[1:]
bodies = soup.find_all('summary')
links = soup.find_all('link', title='pdf')
for title, body, link in zip(titles, bodies, links):
yield (link['href'],
title.text.strip().replace("\n", " "),
body.text.strip().replace("\n", " "))
def on_get(self, req, resp):
"""Handles GET requests"""
resp.media = [list(self.pull_from_arxiv(*item))
for item in req.params.items()]
附带注意:使用此方法返回与在arxiv网站搜索字段中输入搜索字符串完全不同的结果。不知道为什么。但是,您的查询也是如此(唯一的区别是,+
和:
被requests.get
编码为%3a
)。
https://codereview.stackexchange.com/questions/217461
复制相似问题