文章/答案/技术大牛

发布

社区首页 >问答首页 >从tesseract hocr xhtml文件中提取数据

问从tesseract hocr xhtml文件中提取数据
EN

Stack Overflow用户

提问于 2018-06-05 14:10:46

回答 1查看 3.5K关注 0票数 4

我试图使用Python从Tesseract的hocr输出文件中提取数据。我们仅限于tesseact版本3.04，因此没有image_to_data函数或tsv输出可用。我可以用漂亮的汤和R来做这件事，但这在需要部署它的环境中都是不可用的。我只是试着提取“x_wconf”这个词和信心。下面是一个输出文件示例，我很乐意返回90、87、89、89和[‘’、‘(快速)’、‘布朗’、'{fox}‘、’跳转‘的列表。)

lxml是环境中元素树之外唯一可用的xml解析器，因此我有点不知所措。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.05.00dev' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "./testing/eurotext.png"; bbox 0 0 1024 800; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661">
     <span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1_1' title='bbox 105 66 178 97; x_wconf 90'>The</span> <span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106; x_wconf 87'><strong>(quick)</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109; x_wconf 89'>[brown]</span> <span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110; x_wconf 89'>{fox}</span> <span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113; x_wconf 89'>jumps!</span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

tesseract

hocr

python

xhtml

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-05 15:35:51

想出了一种(总)方法来使用xpath来完成它。

def hocr_to_dataframe(fp):

    from lxml import etree
    import pandas as pd
    import os

    doc = etree.parse('fp')
    words = []
    wordConf = []

    for path in doc.xpath('//*'):
        if 'ocrx_word' in path.values():
            conf = [x for x in path.values() if 'x_wconf' in x][0]
            wordConf.append(int(conf.split('x_wconf ')[1]))
            words.append(path.text)

    dfReturn = pd.DataFrame({'word' : words,
                             'confidence' : wordConf})

    return(dfReturn)

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50702264

复制

相似问题

问从tesseract hocr xhtml文件中提取数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从tesseract hocr xhtml文件中提取数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从tesseract hocr xhtml文件中提取数据
EN