我有一个来自iTextSharp的结果,它可以通过pdf阅读器进行解析,但我希望能够获取二进制内容并手动解析它。我尝试在标记<</Length 256/Filter/FlateDecode>>stream
和endstream
之间获取文本,并使用.NET DeflateStream类尝试解压缩文本,这导致了这个异常:
System.IO.InvalidDataException: Block length does not match with its complement. at System.IO.Compression.Inflater.DecodeUncompressedBlock(Boolean& end_of_block) at System.IO.Compression.Inflater.Decode() at System.IO.Compression.Inflater.Inflate(Byte[] bytes, Int32 offset, Int32 length) at System.IO.Compression.DeflateStream.Read(Byte[] array, Int32 offset, Int32 count) at System.IO.Stream.InternalCopyTo(Stream destination, Int32 bufferSize) at FlateDecodeTest.Decompress(Byte[] data)
我的代码是:
using System;
using System.Security.Cryptography;
using System.Text;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
public class FlateDecodeTest
{
public static void Main()
{
string s = @"xœuÁN!E÷|Å...";
byte[] b = Decompress(GetBytes(s));
Console.WriteLine(GetString(b));
}
public static byte[] Decompress(byte[] data)
{
Console.WriteLine(data.Length);
byte[] decompressedArray = null;
try
{
using (MemoryStream decompressedStream = new MemoryStream())
{
using (MemoryStream compressStream = new MemoryStream(data))
{
using (DeflateStream deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
{
deflateStream.CopyTo(decompressedStream);
}
}
decompressedArray = decompressedStream.ToArray();
}
}
catch (Exception exception)
{
Console.WriteLine(exception);
}
return decompressedArray;
}
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
}
发布于 2016-11-23 10:53:49
不要使用DeflateStream
类。如果您对页面的内容流感兴趣(例如第1页),可以使用以下方法:
byte[] streamBytes = reader.GetPageContent(1);
其中reader
是PdfReader
类的一个实例。当然,如果页面的资源字典中有表单XObjects,这是不够的。在这种情况下,您必须使用PRStream
对象。例如:如果表单XObject (或任何其他流对象)具有对象号23,则可以得到如下所示的PRStream
对象:
PRStream str = (PRStream)reader.GetPdfObject(23);
byte[] bytes = PdfReader.GetStreamBytes(str);
与提供原始的压缩字节的GetStreamBytesRaw()
方法相反,GetStreamBytes()
方法将解压流。请参阅iTextSharp: Convert PdfObject to PdfStream
如果您不知道要检查的对象的数量,可以遍历PDF对象树,例如使用PdfDictionary
的PdfDictionary
方法、PdfArray
等。
https://stackoverflow.com/questions/40762055
复制相似问题