blocks|key|1615439|text|我知道这不是一个完美的答案。但我想和大家分享一段我从2013年PyCon上看到的可能适用的视频。它有点缺乏实现细节，但可能会对您如何解决/改进问题提供一些指导/启发。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1615440|链接到视频|offset|length|1615441|链接到演示文稿|1615442|如果你真的决定使用ImageMagick对你的源图像进行预处理的话。这里是一个问题，它为您提供了很好的python绑定。|1615443|旁白。对于Tesseract来说是很重要的事情。你需要训练它，否则它就不会像它所能达到的那样好/准确。|1615444|entityMap|0|LINK|mutability|MUTABLE|url|http://pyvideo.org/video/1702/building-an-image-processing-pipeline-with-python|1|https://us.pycon.org/2013/schedule/presentation/107/|2|https://stackoverflow.com/questions/7895278/can-i-access-imagemagick-api-with-python^0|0|0|5|0|0|0|7|1|0|Y|2|2|0|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Y|8|@]|9|@$D|Z|E|10|1|11]]|A|$]]|$1|F|3|G|5|6|7|12|8|@]|9|@$D|13|E|14|1|15]]|A|$]]|$1|H|3|I|5|6|7|16|8|@]|9|@$D|17|E|18|1|19]]|A|$]]|$1|J|3|K|5|6|7|1A|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]|T|$5|O|P|Q|A|$R|U]]|V|$5|O|P|Q|A|$R|W]]]]

I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.

<a href="http://pyvideo.org/video/1702/building-an-image-processing-pipeline-with-python" rel="nofollow noreferrer">Link to Video</a>

<a href="https://us.pycon.org/2013/schedule/presentation/107/" rel="nofollow noreferrer">Link to Presentation</a>

And if you do decide to use ImageMagick to pre-process your source images a little. <a href="https://stackoverflow.com/questions/7895278/can-i-access-imagemagick-api-with-python">Here</a> is question that points you to nice python bindings for it.

On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.

blocks|key|2692714|text|不确定您的意图是否用于商业用途，但这是工作的工作，如果您执行OCR在一堆类似的图像。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2692715|http://www.fmwconcepts.com/imagemagick/textcleaner/index.php|offset|length|2692716|原创|2692717|📷|atomic|2692718|​|2692719|用给定的参数进行预处理后。|2692720|2692721|2692722|2692723|entityMap|0|LINK|mutability|MUTABLE|url|1|IMAGE|IMMUTABLE|imageUrl|https://developer.qcloudimg.com/http-save/yehe-900000/f4743bf0c1d27a5c654ffe10b60c142d.jpg|imageAlt|2|https://developer.qcloudimg.com/http-save/yehe-900000/1f6f4ce7f5cd73c959ae2a587515128c.png^0|0|0|1O|0|0|0|0|1|1|0|0|0|0|0|1|2|0|0^^$0|@$1|2|3|4|5|6|7|16|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|17|8|@]|9|@$D|18|E|19|1|1A]]|A|$]]|$1|F|3|G|5|6|7|1B|8|@]|9|@]|A|$]]|$1|H|3|I|5|J|7|1C|8|@]|9|@$D|1D|E|1E|1|1F]]|A|$]]|$1|K|3|L|5|6|7|1G|8|@]|9|@]|A|$]]|$1|M|3|N|5|6|7|1H|8|@]|9|@]|A|$]]|$1|O|3|L|5|6|7|1I|8|@]|9|@]|A|$]]|$1|P|3|I|5|J|7|1J|8|@]|9|@$D|1K|E|1L|1|1M]]|A|$]]|$1|Q|3|L|5|6|7|1N|8|@]|9|@]|A|$]]|$1|R|3|-4|5|6|7|1O|8|@]|9|@]|A|$]]]|S|$T|$5|U|V|W|A|$X|C]]|Y|$5|Z|V|10|A|$11|12|13|-4]]|14|$5|Z|V|10|A|$11|15|13|-4]]]]

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.

<a href="http://www.fmwconcepts.com/imagemagick/textcleaner/index.php" rel="noreferrer">http://www.fmwconcepts.com/imagemagick/textcleaner/index.php</a>

ORIGINAL
<img src="https://i.stack.imgur.com/2vt1O.jpg" alt="ORIGINAL">

After Pre-Processing with given arguments.

<img src="https://i.stack.imgur.com/yzSrQ.png" alt="After Pre-Processing with given arguments.">

blocks|key|2692824|text|事实证明，tesseract维基有一篇文章用我能想象到的最好的方式回答了这个问题：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2692825|关于输出“的插图指南。|unordered-list-item|2692826|提高OCR准确性的图像处理问题也可能令人感兴趣。|2692827|(初步答案，请记录在案)|2692828|我没有使用PyTesser，但是我已经在tesseract+(版本：3.02.02)上做了一些实验。|2692829|如果在彩色图像上调用tesseract，则首先应用全局Otsu法对其进行二进制化，然后在二进制(黑白)图像上运行实际的字符识别。|2692830|图片来源：otsu.html|2692831|​|2692832|📷|atomic|2692833|2692834|可以看出，“全球Otsu”可能并不总是产生理想的结果。|2692835|要更好地理解tesseract的“看见”是将Otsu的方法应用到您的图像中，然后查看生成的图像。|2692836|总结:提高识别率的最简单的方法是自己对图像进行二值化(很可能通过反复试验找到好的阈值)，然后将这些二值化的图像传递给tesseract。|2692837|有人很好地发布了http://fossies.org/dox/tesseract-ocr-3.02.02/，所以可以验证以前关于处理管道的声明：http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#ga09be3b61fd89f7803fe37cc420b92b30+->+http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#gaee19c9ea78a647420bbe99a447569995+->+http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a8240c360cff397784e7e9f635d9ed7a3+->+http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a9bbeac96aad481ce652816d8780b6e00。|2692838|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/tesseract-ocr/tesseract|1|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality|2|https://stackoverflow.com/a/10034214/2419207|3|https://en.wikipedia.org/wiki/Otsu%2527s_method|4|http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html|5|IMAGE|IMMUTABLE|imageUrl|https://developer.qcloudimg.com/http-save/yehe-900000/685d7619fd1db4123f2a923686330ff9.png|imageAlt|6|http://fossies.org/dox/tesseract-ocr-3.02.02/|7|http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#ga09be3b61fd89f7803fe37cc420b92b30|8|http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#gaee19c9ea78a647420bbe99a447569995|9|http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a8240c360cff397784e7e9f635d9ed7a3|10|http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a9bbeac96aad481ce652816d8780b6e00^0|5|9|5|9|0|0|2|3|1|0|0|D|2|0|0|5|8|K|9|Y|7|0|R|5|3|0|5|9|4|0|0|0|1|5|0|0|0|0|1M|9|0|8|19|6|20|2V|7|4Z|2V|8|7Y|3A|9|BC|3A|A|0^^$0|@$1|2|3|4|5|6|7|22|8|@$9|23|A|24|B|C]]|D|@$9|25|A|26|1|27]]|E|$]]|$1|F|3|G|5|H|7|28|8|@]|D|@$9|29|A|2A|1|2B]]|E|$]]|$1|I|3|J|5|H|7|2C|8|@]|D|@$9|2D|A|2E|1|2F]]|E|$]]|$1|K|3|L|5|6|7|2G|8|@]|D|@]|E|$]]|$1|M|3|N|5|6|7|2H|8|@$9|2I|A|2J|B|C]|$9|2K|A|2L|B|C]|$9|2M|A|2N|B|C]]|D|@]|E|$]]|$1|O|3|P|5|6|7|2O|8|@]|D|@$9|2P|A|2Q|1|2R]]|E|$]]|$1|Q|3|R|5|6|7|2S|8|@]|D|@$9|2T|A|2U|1|2V]]|E|$]]|$1|S|3|T|5|6|7|2W|8|@]|D|@]|E|$]]|$1|U|3|V|5|W|7|2X|8|@]|D|@$9|2Y|A|2Z|1|30]]|E|$]]|$1|X|3|T|5|6|7|31|8|@]|D|@]|E|$]]|$1|Y|3|Z|5|6|7|32|8|@]|D|@]|E|$]]|$1|10|3|11|5|6|7|33|8|@]|D|@]|E|$]]|$1|12|3|13|5|6|7|34|8|@$9|35|A|36|B|C]]|D|@]|E|$]]|$1|14|3|15|5|6|7|37|8|@]|D|@$9|38|A|39|1|3A]|$9|3B|A|3C|1|3D]|$9|3E|A|3F|1|3G]|$9|3H|A|3I|1|3J]|$9|3K|A|3L|1|3M]]|E|$]]|$1|16|3|-4|5|6|7|3N|8|@]|D|@]|E|$]]]|17|$18|$5|19|1A|1B|E|$1C|1D]]|1E|$5|19|1A|1B|E|$1C|1F]]|1G|$5|19|1A|1B|E|$1C|1H]]|1I|$5|19|1A|1B|E|$1C|1J]]|1K|$5|19|1A|1B|E|$1C|1L]]|1M|$5|1N|1A|1O|E|$1P|1Q|1R|-4]]|1S|$5|19|1A|1B|E|$1C|1T]]|1U|$5|19|1A|1B|E|$1C|1V]]|1W|$5|19|1A|1B|E|$1C|1X]]|1Y|$5|19|1A|1B|E|$1C|1Z]]|20|$5|19|1A|1B|E|$1C|21]]]]

As it turns out, <a href="https://github.com/tesseract-ocr/tesseract" rel="nofollow noreferrer"><code>tesseract</code></a> wiki has an article that answers this question in best way I can imagine:

<ul>
<li>Illustrated guide about <a href="https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality" rel="nofollow noreferrer">"Improving the quality of the [OCR] output"</a>.</li>
<li>Question <a href="https://stackoverflow.com/a/10034214/2419207">"image processing to improve tesseract OCR accuracy"</a> may also be of interest.</li>
</ul>

<hr>

(initial answer, just for the record)

I haven't used <code>PyTesser</code>, but I have done some experiments with <code>tesseract</code> (version: <code>3.02.02</code>).

If you invoke tesseract on colored image, then it first applies global <a href="https://en.wikipedia.org/wiki/Otsu%27s_method" rel="nofollow noreferrer">Otsu's method</a> to binarize it and then actual character recognition is run on binary (black and white) image.

Image from: <a href="http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html" rel="nofollow noreferrer">http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html</a>

<img src="https://i.stack.imgur.com/t8QqZ.png" alt="Otsu&#39;s threshold illustration">

As it can be seen, 'global Otsu' may not always produce desirable result. 

To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.

In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to <code>tesseract</code>.

Somebody was kind enough to publish <a href="http://fossies.org/dox/tesseract-ocr-3.02.02/" rel="nofollow noreferrer">api docs for tesseract</a>, so it is possible to verify previous statements about processing pipeline: <a href="http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#ga09be3b61fd89f7803fe37cc420b92b30" rel="nofollow noreferrer">ProcessPage</a> -> <a href="http://fossies.org/dox/tesseract-ocr-3.02.02/group__AdvancedAPI.html#gaee19c9ea78a647420bbe99a447569995" rel="nofollow noreferrer">GetThresholdedImage</a> -> <a href="http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a8240c360cff397784e7e9f635d9ed7a3" rel="nofollow noreferrer">ThresholdToPix</a> -> <a href="http://fossies.org/dox/tesseract-ocr-3.02.02/classtesseract_1_1ImageThresholder.html#a9bbeac96aad481ce652816d8780b6e00" rel="nofollow noreferrer">OtsuThresholdRectToPix</a>

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:

<ol>
<li>File with icons, images and text - 5-10% accurate</li>
<li>File with only text(images and icons erased) - 50-60% accurate</li>
<li>File with stretching(And this is the best part) - Stretching file
in 2) above on x or y axis increased the accuracy by 10-20%</li>
</ol>

So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?

...........

Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:

<pre><code>1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.
</code></pre>

So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

Image cleaning before OCR application

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

在过去的几个小时里，我一直在试验PyTesser，这是一个非常好的工具。关于PyTesser的准确性，我注意到了以下几点：带有图标、图像和文本的文件-- 5-10%的准确性只有文本的文件(图像和图标被擦除)- 50-60%的准确性在x轴或y轴上的拉伸文件(这是最好的部分)--在x轴或y轴上拉伸文件的精度提高了10-20...

问OCR应用前的图像清理
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问OCR应用前的图像清理EN