首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >从powerpoint文件中分离文本提取时遇到问题

从powerpoint文件中分离文本提取时遇到问题
EN

Stack Overflow用户
提问于 2019-04-28 00:58:26
回答 2查看 24关注 0票数 1

我有一个从PowerPoints中提取文本的函数。然而,输出的是一个大列表中所有powerpoint文件的所有文本。如何将文本分开,以便为提取的两个powerpoint文件生成两个文本列表?

代码语言:javascript
复制
text_runs = []

def pptx_collect(x):
    for file in pptx_files:
        prs = Presentation(file)
        for slide in prs.slides:
            for shape in slide.shapes:
                if not shape.has_text_frame:
                    continue
                for paragraph in shape.text_frame.paragraphs:
                    for run in paragraph.runs:
                        text_runs.append(run.text)
    return(text_runs)

def Powerpoint(pptx_files):
        for name in pptx_files:
                #print(name)
                IP_list = (pptx_collect(name))
                for item in IP_list:
                        #print(item)
                    keyword = re.findall(inp,item)
                    keyword1 = re.findall(inp1,item)
                    keyword2 = re.findall(word_search,item)
                #print(ip_test)
                file_dict['keyword'].append(keyword+keyword1+keyword2)
                file_dict['name'].append(name.name[0:])
                file_dict['created'].append(time.ctime(name.stat().st_ctime))
                file_dict['modified'].append(time.ctime(name.stat().st_mtime))
                file_dict['path'].append(name)
                file_dict["content"].append(IP_list) #<--- This is where the 
                                                            #problem is.
                #print(file_dict)
        return(file_dict)
Powerpoint(pptx_files)

我得到的输出是:

代码语言:javascript
复制
['Billy’s ', 'pii', 'Just a test', '04/15/1991', '04.15.1991', '234-23-6456-billys ', 'SSN', 'Address: 58 bonnie ', 'rd', ', 'mass 07037', 'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)',  'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)', 'FOUO Test', 'Secret', 'This is a test to check ', 'for keywords']

我想要得到:

代码语言:javascript
复制
['Billy’s ', 'pii', 'Just a test', '04/15/1991', '04.15.1991', '234-23-6456-billys ', 'SSN', 'Address: 58 bonnie ', 'rd', ', Boston, mass 07037', 'Text from second 2 '] 

['Text from second ', 'powerpoint', ' ', '(second page)',  'Text from second 2 ', 'Text from second ', 'powerpoint', ' ', '(second page)', 'FOUO Test', 'Secret', 'This is a test to check ', 'for keywords']
EN

回答 2

Stack Overflow用户

发布于 2019-04-28 01:27:07

pptx_collect()函数遍历所有文件。试试这个:

代码语言:javascript
复制
def pptx_collect(x):
    prs = Presentation(x)
    for slide in prs.slides:
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    text_runs.append(run.text)
    return(text_runs)
票数 0
EN

Stack Overflow用户

发布于 2019-04-28 04:38:52

代码语言:javascript
复制
def pptx_collect(x):
    for file in pptx_files:
        inner_list = []
        prs = Presentation(file)
        for slide in prs.slides:
            for shape in slide.shapes:
                if not shape.has_text_frame:
                    continue
                for paragraph in shape.text_frame.paragraphs:
                    for run in paragraph.runs:
                        inner_list.append(run.text)
        text_runs.append(inner_list)
    return(text_runs)

我还建议在函数中定义text_runs

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55882853

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档