文章/答案/技术大牛

发布

社区首页 >问答首页 >如何查询具体年份的arXiv？

问如何查询具体年份的arXiv？
EN

Stack Overflow用户

提问于 2020-09-24 21:16:32

回答 1查看 449关注 0票数 0

我使用下面显示的代码从arXiv检索论文。我想检索标题中包含“机器”和“学习”的论文。论文的数量很大，因此我想实现一个按年切片(published)。

如何在search_query中请求2020年和2019年的记录？请注意，我对后期过滤不感兴趣。

import urllib.request

import time
import feedparser

# Base api query url
base_url = 'http://export.arxiv.org/api/query?';

# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
start = 0
total_results = 5000
results_per_iteration = 1000
wait_time = 3

papers = []

print('Searching arXiv for %s' % search_query)

for i in range(start,total_results,results_per_iteration):
    
    print("Results %i - %i" % (i,i+results_per_iteration))
    
    query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
                                                         i,
                                                         results_per_iteration)

    # perform a GET request using the base_url and query
    response = urllib.request.urlopen(base_url+query).read()

    # parse the response using feedparser
    feed = feedparser.parse(response)

    # Run through each entry, and print out information
    for entry in feed.entries:
        #print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
        #print('Title:  %s' % entry.title)
        #feedparser v4.1 only grabs the first author
        #print('First Author:  %s' % entry.author)
        paper = {}
        paper["date"] = entry.published
        paper["title"] = entry.title
        paper["first_author"] = entry.author
        paper["summary"] = entry.summary
        papers.append(paper)
    
    # Sleep a bit before calling the API again
    print('Bulk: %i' % 1)
    time.sleep(wait_time)

api

urllib

feedparser

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-24 21:28:30

根据arXiv documentation，没有可用的published或date字段。

您可以做的是按日期进行sort the results (通过在查询参数中添加&sortBy=submittedDate&sortOrder=descending )，并在达到2018年时停止请求。

基本上，你的代码应该像这样修改：

import urllib.request

import time
import feedparser

# Base api query url
base_url = 'http://export.arxiv.org/api/query?';

# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
i = 0
results_per_iteration = 1000
wait_time = 3
papers = []
year = ""  
print('Searching arXiv for %s' % search_query)

while (year != "2018"): #stop requesting when papers date reach 2018
    print("Results %i - %i" % (i,i+results_per_iteration))
    
    query = 'search_query=%s&start=%i&max_results=%i&sortBy=submittedDate&sortOrder=descending' % (search_query,
                                                         i,
                                                         results_per_iteration)

    # perform a GET request using the base_url and query
    response = urllib.request.urlopen(base_url+query).read()

    # parse the response using feedparser
    feed = feedparser.parse(response)
    # Run through each entry, and print out information
    for entry in feed.entries:
        #print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
        #print('Title:  %s' % entry.title)
        #feedparser v4.1 only grabs the first author
        #print('First Author:  %s' % entry.author)
        paper = {}
        paper["date"] = entry.published
        year = paper["date"][0:4]
        paper["title"] = entry.title
        paper["first_author"] = entry.author
        paper["summary"] = entry.summary
        papers.append(paper)
    # Sleep a bit before calling the API again
    print('Bulk: %i' % 1)
    i += results_per_iteration
    time.sleep(wait_time)

对于“后过滤”方法，一旦收集到足够的结果，我会这样做：

papers2019 = [item for item in papers if item["date"][0:4] == "2019"]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64047299

复制

相似问题

问如何查询具体年份的arXiv？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何查询具体年份的arXiv？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何查询具体年份的arXiv？
EN