python：自动下载sci-hub文献

生信菜鸟团

发布于 2022-02-17 11:41:38

3K0

发布于 2022-02-17 11:41:38

文章被收录于专栏：生信菜鸟团

科研神器sci-hub下载文献简单易用，但是仍然需要手动下载，如果待下载文献太多，就不那么友好了，最好可以自动批量下载，而这个正好是python requests库所擅长的。

本文目的仅用于学术交流及探讨requests库的用法。

需求及可行性分析

以sci-hub站点https://sci-hub.se/为例，测试文献名为Concise Review: MSC-Derived Exosomes for Cell-Free Therapy.，浏览器使用Chrome。

当我们输入文献名到sci-hub搜索框点击查询之后，可以发现页面自动跳转到了文献详情页，详情页的页面左侧是下载按钮和文献注释、连接等信息，右侧是自动载入的pdf版本的文献。

按F12打开网页调试工具查看此时的网页结构，点击左上的选择工具后再去页面中点击“下载”按钮，此时就会自动跳转到“下载”按键对应的网页布局的位置，如下图可以发现这个按钮是一个button组件，它的属性中有pdf的真实地址：https://twin.sci-hub.se/6288/3e57817d8f436407d5477c3b0affc56d/phinney2017.pdf，手动打开这个链接就是文献的pdf文件的源地址。

所以获得这个详情页面后，通过解析这个页面中的按钮的onlick属性就可以获得pdf文件地址。

现在需要知道如何去使用requests构造http请求来获取这个详情页面，在打开网页调试工具并且切换到“Network”选项卡的情况下，重新打开sci-hub站点https://sci-hub.se/，并搜索文献Concise Review: MSC-Derived Exosomes for Cell-Free Therapy.，结果如下图，我们点击第一个请求可以知道它是一个post请求，并且其状态是302页面重定向，很明显这个请求就是需要构造的请求。

第二个请求（stem.2575）就是详情页面的源码。

这个请求的地址和data等信息在下面图示中。

所以最终文献下载的思路就是，通过对https://sci-hub.se/构造一个post请求，页面自动重定向到文献详情页，然后通过解析详情页中pdf的源地址来下载。这些两个需求均可以使用requests库来完成。

具体代码测试

post请求除了接口地址之外，还需要提供附属的data信息，这个信息就是上图的Payload，可以看到data中上传了两个参数：sci-hub-plugin-check和request，sci-hub-plugin-check没有具体值，而request很重要，是要查询的文献名。

import requests
import re

url = 'https://sci-hub.se/'

data = {
    'sci-hub-plugin-check': '',
    'request': 'Concise Review: MSC-Derived Exosomes for Cell-Free Therapy.'
}

headers = {
    'referer': 'https://sci-hub.se/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

# 请求
res = requests.post(
    url = url, 
    headers  =headers,
    data = data
)

res是返回的结果，res.content是返回的数据，如果是文本数据，使用res.text可以自动解码获取文本信息。另可以在res.history中找到跳转的网页，这也可以表明跳转成功了，代表res就是详情页面信息。

res.history
#[<Response [302]>]

print(res.text)
#<!DOCTYPE html>
#<html>
#    <head>
#        <title>Sci-Hub | Concise Review: MSC-Derived Exosomes for Cell-Free Therapy. STEM CELLS, 35(4), 851–858 | 10.1002/stem.2575</title>
#        <meta charset="UTF-8">
#        <meta name="viewport" content="width=device-width">
#        <script src="//sci-hub.se/misc/js/jquery-3.6.0.min.js"></script>
#    </head>
#    <body>
#    <script type = "text/javascript">
#...<省略>...

其实res.content和res.text是同样的内容，只不过一个已经是字符格式，一个是原始的二进制数据，比如将res.content进行解码后和res.text是一致的。

res.content.decode("utf-8") == res.text
#True

现在就需要从res.text中解析出pdf文件的下载地址，已经知道它在一个button组件的onclick属性中，这个信息可以使用正则给找到：在location.href='之后的字符串一直到pdf为止：

html <button onclick = "location.href='https://twin.sci-hub.se/6288/3e57817d8f436407d5477c3b0affc56d/phinney2017.pdf?download=true'">↓ 下載</button>

pat = re.compile("location.href='(.*?pdf)")
pat.findall(res.text)
#['https://twin.sci-hub.se/6288/3e57817d8f436407d5477c3b0affc56d/phinney2017.pdf']

于是就拿到了文献的源地址，现在只需要将这个文献地址构造一个get请求，然后返回的content就是文献数据了。

# 获取文献数据
pdf_path = pat.findall(res.text)[0]

pdf_res = requests.get(
    url = pdf_path
)

# paf_res.content写入文件
# 由于文献名中有:等特殊字符，导致无法在window中创建文件，先将名字中的这些奇怪的字符都给删除了
pdf_name = f"{re.sub('[:@#$&%.+]*', '', data['request'])}.pdf"
with open(pdf_name, "wb") as f:
    f.write(pdf_res.content)

然后当前文件夹下就可以看到pdf文件已经被正常下载下来了：

完整代码实现

将上述思路规整一下，既可以用于批量下载文献。

sci-hub并不需要设置headers，这里提供了headers作为一个选项以提高灵活性。

papers用于指定下载的文献，其中第四篇文献设置为“Error paper name”，以作为一个错误测试，出错的文献名将会保存在log.txt文件中。

import requests
import re

# Necessary variable settings
url = 'https://sci-hub.se/'

papers = [
        "Concise Review: MSC-Derived Exosomes for Cell-Free Therapy.",
        "MSC exosome works through a protein-based mechanism of action",
        "Mammalian MSC from selected species: Features and applications",
        "Error paper name"
    ]

headers = {
    'referer': 'https://sci-hub.se/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}


# function definition
def get_html(url, papers = None, headers = None):
    if not url:
        raise Exception("url is None, please check...")

    if isinstance(papers, list):
        data = [{'sci-hub-plugin-check': '', 'request':p} for p in papers]
        res = [requests.post(url, data = d, headers = headers) for d in data]
        res_text = [r.text for r in res]
        return res_text
    else:
        data = {'sci-hub-plugin-check': '', 'request':papers}
        res = requests.post(url, data = data, headers=headers)
        return res.text

def get_pdf_path(html, pattern = "location.href='(.*?pdf)"):
    pat = re.compile(pattern)

    if isinstance(html, list):
        pdf_path = [pat.findall(h) for h in html]
    else:
        pdf_path = pat.findall(html)

    return pdf_path

def get_pdf(path):
    res = []
    if isinstance(path, list):

        for p in path:
            try:
                pdf = requests.get(p).content
            except Exception:
                pdf = None
            res.append(pdf)
    else:
        res.append(requests.get(path).content)
    return res


def main(url, papers = None, headers = None, pattern = "location.href='(.*?pdf)"):
    print(f"===== Get html =====")
    html = get_html(url, papers, headers)

    print(f"===== Get pdf path =====")
    pdf = get_pdf_path(html, pattern = pattern)
    pdf_path = [p[0] if len(p) > 0 else None for p in pdf]

    print(f"===== Get pdf content =====")
    pdf_content = get_pdf(pdf_path)

    if not isinstance(papers, list):
        papers = [papers]

    with open("log.txt", "w") as f_log:
        for idx,p in enumerate(pdf_content):
            print(f"{idx+1}: {papers[idx]} \n   downloading...")
            if p == None:
                print(f"   failed and log into log_file...")
                f_log.write(f"{papers[idx]}\n")
            else:
                print(f"   success")
                pdf_name = re.sub('[\/:?*"<>|]*', '', papers[idx])
                pdf_name = f"{pdf_name}.pdf"

                with open(pdf_name, "wb") as f_pdf:
                    f_pdf.write(p)

# compatible with package
if __name__ == "__main__":
    main(url, papers = papers, headers = headers)

运行结果如下：

===== Get html =====
===== Get pdf path =====
===== Get pdf content =====
1: Concise Review: MSC-Derived Exosomes for Cell-Free Therapy.
   downloading...
   success
2: MSC exosome works through a protein-based mechanism of action
   downloading...
   success
3: Mammalian MSC from selected species: Features and applications
   downloading...
   success
4: Error paper name
   downloading...
   failed and log into log_file...

文件保存在当前文件夹：