文章/答案/技术大牛

发布

社区首页 >问答首页 >如何读取txt文件在不同目录中的内容，并根据

问如何读取txt文件在不同目录中的内容，并根据
EN

Stack Overflow用户

提问于 2015-07-20 17:11:36

回答 1查看 349关注 0票数 2

我刚开始使用Python 3，遇到了以下问题：

为了我的论文，我从不同的期刊上下载了大量的PDF，但它们都是以它们的DOI命名的，而不是以“作者(年份)-标题”的格式命名的。这些文件根据期刊的名称和卷保存在不同的目录中，例如：

/Journal 1/
    /Vol. 1/
        file1.pdf
        file1.txt
        file2.pdf
        file2.txt
        filen.pdf
        filen.txt
    /Vol. 2/
        file1.pdf
        file1.txt
/Journal 2/
    ...

因为我不知道如何用Python读取PDF的内容，所以我编写了一个很短的bash脚本，它将PDF转换成简单的TXT文件。pdf和txt文件具有相同的名称和不同的文件扩展名。

我想重命名所有的PDF文件，幸运的是，在每个文件的连续文本中有一个字符串，我可以使用。此变量字符串位于两个静态字符串之间：

"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".

如何使Python进入每个目录，读取TXT/PDF的内容，提取两个固定字符串之间的变量字符串，然后重命名适当的PDF文件？

如果有人知道如何使用Python 3实现这一点，我将非常感谢。

python

pdf

iteration

rename

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-07-24 23:12:24

终于成功了：

#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-

import os, re, ntpath, shutil

for root, dirs, files in os.walk(os.getcwd()):
    for file in files: # loops through directories and files
        if file.endswith(('.txt')): # only processes txt files
            full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
            # builds correct path under Win 7 (and probably other NT-systems

            with open(full_path, 'r', encoding='utf-8') as f:
                content = f.read().replace('\n', '') # remove newline

                r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
                m = r.search(content)
                # finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
                # also finds typos like "Journal ofQuantitative ..."

                if m:
                    full_title = m.group(1)

            print("full_title: {0}".format(full_title))
            full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
                .replace('>','')
                .replace(':',' -')
                .replace('"','')
                .replace('/','')
                .replace('\\','')
                .replace('|','')
                .replace('?','')
                .replace('*',''))

            pdf_name = full_path.replace('txt','pdf')
            # since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
            # to get the right name

            print('File: '+ file)
            print('Full Path: ' + full_path)
            print('Full Title: ' + full_title)
            print('PDF Name: ' + pdf_name)
            print('....................................')
            # for trouble shooting

            dirname = ntpath.dirname(pdf_name)
            new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))

            if ntpath.exists(full_path):
                print("all paths found")
                shutil.copy(pdf_name, new_path)
                # makes a copy of the pdf file with the new name in the respective directory

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/31522367

复制

相似问题

问如何读取txt文件在不同目录中的内容，并根据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何读取txt文件在不同目录中的内容，并根据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何读取txt文件在不同目录中的内容，并根据
EN