前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

作者头像
汀丶人工智能
发布2023-12-14 13:23:51
2060
发布2023-12-14 13:23:51
举报
文章被收录于专栏:NLP/KGNLP/KG

Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

这是Nougat的官方存储库,Nougat是一种学术文档PDF解析器,可以理解LaTeX数学和表格。

Project page: https://facebookresearch.github.io/nougat/

1.安装

From pip:

代码语言:javascript
复制
pip install nougat-ocr

From repository:

代码语言:javascript
复制
pip install git+https://github.com/facebookresearch/nougat

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

如果您想从API调用模型或生成数据集,则会有额外的依赖项。 安装通过

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

1.2 获取PDF的预测

1.2.1 CLI

To get predictions for a PDF run

代码语言:javascript
复制
$ nougat path/to/file.pdf -o output_directory

目录或文件的路径(其中每行都是PDF的路径)也可以作为位置参数传递

代码语言:javascript
复制
$ nougat path/to/directory -o output_directory
代码语言:javascript
复制
usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

代码语言:javascript
复制
$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

1.2.2 API

With the extra dependencies you use app.py to start an API. Call

代码语言:javascript
复制
$ nougat_api

通过向http://127.0.0.1:8503/ predict/发出POST请求来获得PDF文件的预测。它还接受参数“start”和“stop”,以限制计算选择页码(包括边界)。

响应是一个带有文档标记文本的字符串。

代码语言:javascript
复制
curl -X 'POST' \
  'http://127.0.0.1:8503/predict/' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@<PDFFILE.pdf>;type=application/pdf'

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

2.Dataset

2.1 生成数据集

To generate a dataset you need

  1. A directory containing the PDFs
  2. A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
  3. A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

代码语言:javascript
复制
python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

Additional arguments include

Argument

Description

--recompute

recompute all splits

--markdown MARKDOWN

Markdown output dir

--workers WORKERS

How many processes to use

--dpi DPI

What resolution the pages will be saved at

--timeout TIMEOUT

max time per paper in seconds

--tesseract

Tesseract OCR prediction for each page

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

代码语言:javascript
复制
python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

For each jsonl file you also need to generate a seek map for faster data loading:

代码语言:javascript
复制
python -m nougat.dataset.gen_seek file.jsonl

The resulting directory structure can look as follows:

代码语言:javascript
复制
root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required. This can be useful for pushing to a S3 bucket by halving the amount of files.

2.2Training

To train or fine tune a Nougat model, run

代码语言:javascript
复制
python train.py --config config/train_nougat.yaml

2.3 Evaluation

Run

代码语言:javascript
复制
python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

To get the results for the different text modalities, run

代码语言:javascript
复制
python -m nougat.metrics path/to/results.json

2.4 FAQ

  • Why am I only getting [MISSING_PAGE]? Nougat was trained on scientific papers found on arXiv and PMC. Is the document you’re processing similar to that? What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work. If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.
  • Where can I download the model checkpoint from. They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

参考链接: https://github.com/facebookresearch/nougat

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2023-12-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值
  • 1.安装
    • 1.2 获取PDF的预测
      • 1.2.1 CLI
      • 1.2.2 API
  • 2.Dataset
    • 2.1 生成数据集
      • 2.2Training
        • 2.3 Evaluation
          • 2.4 FAQ
          相关产品与服务
          AI 应用产品
          文字识别(Optical Character Recognition,OCR)基于腾讯优图实验室的深度学习技术,将图片上的文字内容,智能识别成为可编辑的文本。OCR 支持身份证、名片等卡证类和票据类的印刷体识别,也支持运单等手写体识别,支持提供定制化服务,可以有效地代替人工录入信息。
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档