前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python统计pdf中英文单词的个数

Python统计pdf中英文单词的个数

作者头像
阿黎逸阳
发布2023-09-20 08:46:46
3690
发布2023-09-20 08:46:46
举报
文章被收录于专栏:阿黎逸阳的代码

本文实现python统计pdf中字符的个数。

一、要统计字符的pdf文档

首先看下要统计字符的pdf长什么样。

代码语言:javascript
复制
代码语言:javascript
复制
为了简单、清晰,本文以统计两页英文pdf字符为例进行阐述,代码直接可以应用到任意页数的英文pdf中。

二、识别pdf中的字符

代码语言:javascript
复制
接着应用pdfplumber库识别pdf中的字符,具体代码如下:
代码语言:javascript
复制
import pdfplumber as plb

file_path = r'F:\公众号\74_pdf英文翻译\murphy1996.pdf'
#识别所有页的文字
with plb.open(file_path) as pdf:
    k = 1
    for page in pdf.pages:     
        print(' ')
        print('第',k, '页')    
        print(page.extract_text())
        k += 1
参数详解:
file_path:pdf文件存放路径。
plb.open:打开pdf文件。
page.extract_text():获取该页pdf的文字内容。

得到结果:
第 1 页
Medical and Pediatric Oncology 27:62-63 (1996)
Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric
Oncology Patients
0.M urphy, MB, BCh, BAO, MRCPI, P.J. Marsh, BSC, MB, ChB, MRCPath,
s.1.
j. Gray, MB, ChB, MRCP, MARCPath, Pedler, MB, ChB, MRCPath, and
j. Kernahan, MB, BS, FRcP(Ed) DCH
We report two cases of ecthyma gan- mary skin lesion. Both required prolonged
grenosum which occurred at sites of iatro- courses of antibiotics and one patient died.
genic trauma. The first case developed due The different pathogenic mechanisms and
to metastatic seeding with Pseudornonas outcomes associated with this condition are
aeruginosa during an episode of septicaemia discussed. 01996 Wiley-Liss, Inc.
and the second case occurred as a pri-
Key words: ecthyma gangrenosum, Pseudomonas aeruginosa, iatrogenic
INTRODUCTION ate. No further lesions developed during the remainder of
her treatment.
Ecthyma gangrenosum (EG) is a well recognized cuta-
neous manifestation of P.a eruginosa infections in immu- Case 2
nocompromised patients [ 11. We report two cases of EG
A 13-month-old girl was admitted for investigation of
occurring at sites of iatrogenic trauma in pediatric oncol-
pancytopenia. A diagnosis of aplastic anaemia was made
ogy patients and demonstrate important pathogenic and
following left iliac crest marrow aspirate and trephine
clinical features of this condition.
bone biopsy. She became pyrexial on day 10 following
admission but repeated blood cultures were negative. On
day 24, a 1 cm2 sloughing necrotic area surrounded by
CASE REPORTS purplish erythema was noted at the bone marrow Sam-
pling site. At this time her Hb was 6.6 g/dl and WCC was
Case 1
2.4 X 109/L( neutrophils 0.6 X 109/L). She was treated
A 2-year-old girl with acute lymphoblastic leukaemia
empirically with azlocillin and gentamicin. P. aerugi-
was admitted with a fever, 2 weeks after a course of
nma was isolated from the lesion swab and a diagnosis of
chemotherapy which included intrathecal methotrexate.
EG was made. Blood cultures remained sterile and radio-
She was profoundly neutropenic (WCC 2.2 X lo9 /L, no logical examination did not reveal any evidence of bony
neutrophils). Physical, examination revealed a swollen,
involvement. Despite prolonged antibiotic and topical
erythematous area with a central black eschar over the
therapy, the iliac crest lesion failed to improve. On day
lumbar puncture site. She was commenced empirically
32, she became pyrexial and Enterobacter sp. was iso-
on imipenem-cilastatin and teicoplanin. Following isola-
lated from two blood cultures. She was treated with intra-
tion of P. aeruginosa from both blood cultures and lesion
venous gentamicin and ciprofloxacin. Throughout her
swab, a diagnosis of EG was made and therapy was
illness she required numerous transfusions with platelets
changed to ceftazidime and amikacin. Radiological as-
and red blood cells. A suitable bone marrow donor could
sessment of the lumbar spine did not reveal any evidence
of bony involvement. She became apyrexial on day 3 as
her neutropenia began to recover. She did not require
From the Departments of Microbiology (O.M., P.J.M., J.G., S .J.P.),
treatment with colony stimulating factors. Antimicrobials
and Child Health (J.K.), Royal Victoria Infirmary, Newcastle upon
were discontinued on day 17. Topical silver sulphadia- Tyne, UK.
zine was continued for a further 4 weeks as the lesion
Received April 6, 1995; accepted August 21, 1995
healed slowly by granulation from the base.
Address reprint requests to 0. Murphy, M.B., B.Ch., B.A.O.,
For subsequent chemotherapy, high dose intravenous
M.R.C.P.I., Department of Microbiology, Royal Victoria Infirmary,
methotrexate was substituted for intrathecal methotrex- Queen Victoria Road, Newcastle upon Tyne NEl 4LP, UK.
0 1996 Wiley-Liss, Inc.
 
第 2 页
EG at Sites of Iatrogenic Trauma 63
not be found. A 2-week course of GMCSF was started on In case 1, we believe that seeding to an area of trauma-
day 53 but no improvement in her haematological param- tised skin occurred during bacteraemia. Early recognition
eters was seen and her general condition continued to and aggressive treatment may have played a role in con-
deteriorate. On day 85, she again became pyrexial and a trolling the primary septicaemia but recovery of the pa-
1. O X 1.5 cm ulcer on her right labium majus was noted. tient’s bone marrow probably contributed more to the
Her WCC was 0.4 X 109/L. P. aeruginosa was isolated long-term outcome. In case 2, repeated negative blood
from blood cultures for the first time. Despite aggressive cultures suggest that EG occurred as a primary lesion at a
antibiotic and antifungal treatment, further lesions devel- site of prior skin trauma. Despite aggressive treatment,
oped on her face and chest and she subsequently died. persistent profound neutropenia was associated with fail-
ure of the lesion to resolve and the development of a
secondary bacteraemia and further lesions.
DISCUSSION
Paediatric oncology patients are frequently subject to
Although not pathognomic, ecthyma gangrenosum is a invasive procedures involving minor skin trauma which
well recognised manifestation of P. aeruginosa infection may predispose them to infection with various organisms
in immunocompromised patients. Factors such as neutro- including P. aeruginosa. EG is an extremely difficult
penia, use of bread spectrum antibiotics, loss of skin condition to treat and a high index of suspicion in this
integrity, and moist conditions have been shown to pre- at-risk population is required to ensure early diagnosis
dispose to infection with P. aeruginosa and the develop- and optimum treatment.
ment of EG [2]. Two possible pathogenic mechanisms in
the development of this condition have been postulated
[2,3]. In classic or bacteraemic EG, the lesion is consid-
ered to represent blood-borne metastatic seeding of P.
aeruginosa to the skin. In non-bacteraemic or primary REFERENCES
EG, the lesion is located at the site of entry of the organ-
1. Dorff GJ, Geimer NF, Rosenthal DR, et al.: Pseudornonas septice-
ism into the skin. In these cases the lesions have been
mia: illustrated evolution of its skin lesions. Arch Intern Med 128:
found to occur more commonly in the distribution of 591, 1971.
exocrine glands and secondary bacteraemia has rarely 2 El Baze P, Thyss A, Vinti H, Deville A, Dellamonica P, Ortonne
been reported. Early diagnosis and aggressive therapy are J-P: A study of nineteen immunocompromised patients with exten-
sive skin lesions caused by Pseudomonas aeruginosa with and
important in the management of these patients. Although
without bacteraemia. Acta Derm Venereol (Stockh) 71:411-415,
patients with non-bacteraemic lesions have generally
1991.
been found to have a better prognosis than those with 3. Huminer D, Siegman-Igra Y, Morduchowicz G, Pitlik SD: Ec-
bacteraemic EG [3,4], our experience of survival ulti- thyma gangrenosum without bacteraemia. Report of six cases and a
mately being determined by recovery of neutrophils con- review of the literature. Arch Intern Med 147:299-301, 1987.
4. Fergie JE, Patrick CP, Lott L: Pseudomonas aeruginosa cellulitis
firms that of others [5].
and ecthyma gangrenosum in imrnunocompromised children. Pedi-
To our knowledge, these are the first reports of EG atr Infect Dis J 10:496-500, 1991.
occurring at sites of iatrogenic trauma in paediatric oncol- 5. Greene SL, Daniel Su WP, Muller SA: Ecthyma gangrenosurn:
ogy patients. The only previous report was in an adult report of clinical, histopathologic, and bacteriologic aspects of
with AML who developed EG at the site of placement of eight cases. J Am Acad Dermatol 11:781-787, 1984.
6. Klepflish A, Bembi A. Ecthyma gangrenosum caused by a roving
an ECG electrode [6]. In this case, skin trauma coincided
chest electrode in an acute myeloid leukaemia patient with
with a documented P. aeruginosa septicaemia and meta-
Pseudomonas septicaemia [Letterl. J Am Acad Dermatol 18585-
static seeding was felt to have occurred. 586, 1988.






三、定义统计单页pdf中字符个数的函数应用正则表达式把单页内容处理成列表,并用filter函数过滤掉空值,再统计该页的字符数。具体代码如下:
import re 
import random

def wd_num(pg):
    pg = pg
    pg_wd_num = len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))) 
    return pg_wd_num

参数详解:

re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg):以空格,换行符,逗号,句号,感叹号等为分隔符,把pg内容变成列表。

filter(None, ...):去掉列表中的空格。

len:求列表的长度。

代码语言:javascript
复制
为了大家理解得更透彻,按由内到外的方式逐层实现单页pdf字符统计。

首先是re.split函数调用,代码如下:
pg = '''Ecthyma Gangrenosum Occurring at Sites of Iatrogenic Trauma in Pediatric
    Oncology Patients'''
re.split(r"[\n|\s|,|!|.]", pg)
得到结果:
['Ecthyma',
 'Gangrenosum',
 'Occurring',
 'at',
 'Sites',
 'of',
 'Iatrogenic',
 'Trauma',
 'in',
 'Pediatric',
 '',
 '',
 '',
 '',
 'Oncology',
 'Patients']
可以发现该函数按指定的分隔符把字符串分割成了一个list。
接着过滤掉list中的空值,代码如下:
list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg)))
得到结果:
['Ecthyma',
 'Gangrenosum',
 'Occurring',
 'at',
 'Sites',
 'of',
 'Iatrogenic',
 'Trauma',
 'in',
 'Pediatric',
 'Oncology',
 'Patients']
最后统计这个list的长度,即统计字符串中字符的个数,代码如下:
len(list(filter(None, re.split(r"[\n|\s|,|!|.|(|)|;|-|/|:]", pg))))
得到结果:
12
可以手动核对一下,结果是一致的。


四、统计pdf中字符的个数

最后应用循环统计每一页的字符数量,以及整个pdf的字符数量,代码如下: with plb.open(file_path) as pdf: k = 1 sum_wd_num = 0 for page in pdf.pages: print(' ') pg = page.extract_text() sum_wd_num += wd_num(pg) print('第',k, '页有',wd_num(pg),'个字符') k += 1 print(' ') print('总计有',sum_wd_num,'个字符') 得到结果: 第 1 页有 611 个字符 第 2 页有 674 个字符 总计有 1285 个字符 至此,Python统计pdf中字符的个数已讲解完毕,需要的朋友可以自己跟着代码尝试一遍

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2023-09-10 09:05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 阿黎逸阳的代码 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档