如何使用python从url中提取“.odt”和“.doc”格式文件中的文本?我试着去找,但什么也找不到。
任何线索都会有帮助。
from odf import text, teletype
from odf.opendocument import load
textdoc = load(r"C:\Users\OMS\Downloads\sample1.odt")
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a=teletype.extractText(allparas[i])
print(a)
这适用于本地.odt文件,但现在我需要从
"https://abc.s3.ap-south-1.amazonaws.com/sample1.odt"
假设连接到aws s3已经使用boto3完成。
发布于 2021-01-21 14:26:06
下面是用Python3.6和这测试odt文件进行测试;
import boto3
import io
from odf import text, teletype
from odf.opendocument import load
s3_client = boto3.resource('s3') #TODO: change aws connection logic as per your setup
# TODO: refactor name, readability
def get_contents(file_name):
obj = s3_client.Object('s3_bucket_name', file_name) # TODO: change aws s3 bucket name as per your setup
body = obj.get()['Body'].read()
return load(io.BytesIO(body))
textdoc = get_contents("test.odt") # TODO: change odt file name as per your setup
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a = teletype.extractText(allparas[i])
print(a)
https://stackoverflow.com/questions/65824602
复制相似问题