文章/答案/技术大牛

发布

社区首页 >问答首页 >根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件(在Bash中)

问根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件(在Bash中)
EN

Unix & Linux用户

提问于 2018-06-06 13:55:49

回答 2查看 1.1K关注 0票数 0

我正在使用一种分割单个大PDF文件的方法(它表示信用卡的每月结算)。它是为打印而建的，但我们希望将该文件分割为单个文件，以便后面使用。每个和解方案都有一个可变的长度:2页，3页，4页.因此，我们需要“读取”每一页，找到“X的第1页”，然后再分割到下一页“X的第1页”。另外，每个被分割的文件都必须有一个唯一的Id (也包含在“X的第1页”页面中)。

当我在研发的时候，我发现了一个叫做"PDF内容分割SA“的工具，它可以完成我们所需要的任务。但我确信在Linux中有一种方法可以做到这一点(我们正在向OpenSource+Libre迈进)。

感谢您的阅读。任何帮助都是非常有用的。

<#>编辑

到目前为止，我发现这个Nautilus脚本可以完全满足我们的需要，但我无法使它工作。

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=我已经编辑了搜索条件，脚本很好地放置在Nautilus脚本文件夹中，但是它不起作用。我尝试使用控制台中的活动日志进行调试，并在代码中添加标记；显然与pdfinfo的结果值存在冲突，但我不知道如何解决它。\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    storedid=0
    pattern=''
    pagetitle=''
   fi
  else 
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"
  fi
 done
done

我已经编辑了搜索条件，脚本很好地放置在Nautilus脚本文件夹中，但是它不起作用。我尝试使用控制台中的活动日志进行调试，并在代码中添加标记；显然与pdfinfo的结果值存在冲突，但我不知道如何解决它。

linux

command-line

pdf

split

回答 2

Unix & Linux用户

回答已采纳

发布于 2018-06-07 18:11:48

我做到了。至少起作用了。但现在我想优化这个过程。它需要40分钟来处理1000个项目在一个单一的大规模pdf。

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "${filelist[@]}"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

  header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


  pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')


  echo $pageid
  let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"


   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
#   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=0
    pattern=''
    pagetitle=''

   fi
  else 
   #process previous set of pages to output
#  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pdftk $file cat $pattern output "$storedid.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"

  fi
 done
done

票数 0

Unix & Linux用户

发布于 2018-06-06 14:25:08

快速蟒蛇是一种选择吗？这个包PyPDF2会让你做你想做的事情。

票数 1

页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://unix.stackexchange.com/questions/448212

复制

相似问题

问根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件(在Bash中)
EN

回答 2

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件(在Bash中)EN

回答 2

Unix & Linux用户

Unix & Linux用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件(在Bash中)
EN