我有一份谷歌学术论文的作者名单:Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn
。我想为的报纸摘录和打印标题,至少其中的3。
您可以从每个作者那里获得一本学术论文信息字典:
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
print(author)
输出看起来有点像(只是从一位作者那里摘录的一小段)
{'bib': {'cites': '69',
'title': 'Chalearn looking at people and faces of the world: Face '
'analysis workshop and challenge 2016',
'year': '2016'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
'source': 'citations'},
{'bib': {'cites': '21',
'title': 'The NoXi database: multimodal recordings of mediated '
'novice-expert interactions',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
'source': 'citations'},
{'bib': {'cites': '11',
'title': 'Automatic habitat classification using image analysis and '
'random forest',
'year': '2014'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
'source': 'citations'},
{'bib': {'cites': '10',
'title': 'AutoRoot: open-source software employing a novel image '
'analysis approach to support fully-automated plant '
'phenotyping',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
'source': 'citations'}
我如何收集bib
,特别是四位作者中的三位或更多作者的论文的title
?
编辑:事实上,有人指出,id_citations
并不是每一篇论文所独有的,我的错误。最好只使用title
本身
发布于 2020-07-01 19:50:12
扩展我的评论,您可以使用Pandas实现这一点:
import pandas as pd
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
# creating DataFrame with authors
df = pd.DataFrame([x.__dict__ for x in author.publications])
df['author'] = Author
frames.append(df.copy())
# joining all author DataFrames
df = pd.concat(frames, axis=0)
# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())
# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index
这发现了202篇论文,其中有2篇或更多的作者在这份名单中(在774篇论文中)。下面是输出的一个示例:
Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
'722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
'A Luminescent One-Dimensional Copper (I) Polymer',
'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
...
'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
'Virginia review of sociology',
'Wildlife use of log landings in the White Mountain National Forest',
'Yttrium 1995',
'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
'[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
dtype='object', name='title', length=202)
因为所有的数据都是Pandas的,所以您还可以探索每篇论文的附加作者是什么,以及您可以从学术角度访问的author.publications
数组中的所有其他信息。
发布于 2020-07-01 18:49:29
首先,让我们将其转换为更友好的格式。您说id_citations
对于每一篇论文都是唯一的,所以我们将使用它作为hashtable/dict键。
然后,我们可以将每个id_citation
映射到它出现的bib和author,作为元组(bib, author_name)
的列表。
author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
search_query = scholarly.search_author(author_name)
for bib in search_query:
bib = bib.fill()
bibs.setdefault(bib['id_citations'], []).append((bib, author_name))
之后,我们可以根据附加了多少作者来对bibs
中的密钥进行排序:
most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])
和/或将该列表过滤为只有三种或三种以上表象的引文:
cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name
现在我们可以从那里检索到论文的标题:
paper_titles = [bib['bib']['title'] for bib in cited_enough]
https://stackoverflow.com/questions/62683258
复制相似问题