文章/答案/技术大牛

发布

社区首页 >问答首页 >如何用有限的资源解析Haskell中的大型XML文件？

问如何用有限的资源解析Haskell中的大型XML文件？
EN

Stack Overflow用户

提问于 2015-04-04 19:12:25

回答 1查看 689关注 0票数 5

我想从Haskell中的一个大型XML文件(约20G)中提取信息。因为它是一个大文件，所以我使用了来自赫帕斯的SAX解析函数。

下面是我测试的一段简单代码：

import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax

parse :: FilePath -> IO ()
parse path = do
    inputText <- L.readFile path
    let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
    let txt = foldl' processEvent "" saxEvents
    putStrLn txt

在激活Cabal的特征分析之后，它说parse.saxEvents占用了分配内存的85%。我还使用了foldr，结果是一样的。

如果processEvent变得足够复杂，程序就会崩溃，从而产生stack space overflow错误。

我做错了什么？

profiling

haskell

xml-parsing

Stack Overflow用户

回答已采纳

发布于 2015-04-05 20:59:31

你不会说processEvent是什么样子的。原则上，使用惰性ByteString进行严格的左折叠而不是延迟生成的输入应该是没有问题的，所以我不确定在您的情况下发生了什么问题。但是，在处理巨大的文件时，应该使用适当的流类型！

事实上，hexpat确实有“流”接口(就像xml-conduit一样)。它使用了不太知名的List库和类定义。。原则上，列表包中的类型应该运行良好。由于缺少组合子，我很快就放弃了，并为包装版的List编写了一个丑陋的Pipes.Producer类的适当实例，然后用于导出普通的Pipes.Producer函数，比如parseProduce。此操作所需的琐碎操作作为PipesSax.hs追加在下面。

一旦我们有了parseProducer，我们就可以将一个ByteString或文本生成器转换为带有文本或ByteString组件的SaxEvents的生产者。以下是一些简单的操作。我使用的是一个238米的"input.xml"；从top的角度判断，这些程序永远不需要超过6MB的内存。

-- Sax.hs大多数IO操作都使用在底部定义的registerIds管道，该管道是为大量xml量身定做的，这是一个有效的1000段http://sprunge.us/WaQK

{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX 
import Pipes  -- cabal install pipes pipes-bytestring 
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8

sax :: MonadIO m => Producer ByteString m () 
                 -> Producer (SAXEvent ByteString ByteString) m ()
sax =  parseProducer defaultParseOptions

-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 =  runEffect $ sax stdin >-> P.print

-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout

-- write all IDs to a file
main2 =  
 IO.withFile "input.xml" IO.ReadMode $ \inp -> 
 IO.withFile "output.txt" IO.WriteMode $ \out -> 
   runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out 

-- folds:
-- print number of IDs
main3 =  IO.withFile "input.xml" IO.ReadMode $ \inp -> 
           do n <- P.length $ sax (fromHandle inp) >-> registryIds
              print n

-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 =  IO.withFile "input.xml" IO.ReadMode $ \inp ->
         do let pipeline =  sax (fromHandle inp) >-> registryIds >-> P.map readIntId
            n <- P.fold (+) 0 id pipeline
            print n
  where
   readIntId :: ByteString -> Integer
   readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2

-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")] 
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do 
  e <- await  -- we look for a 'SAXEvent'
  case e of -- if it matches, we yield, else we go to the next event
    StartElement "FacilitySite" [("registryId",a)] -> do yield a
                                                         yield "\n"
                                                         registryIds
    _ -> registryIds

-“图书馆”：PipesSax.hs

这只是新类型的Pipes.ListT来获得适当的实例。我们不导出与List或ListT有关的任何内容，而只使用标准的Pipes.Producer概念。

{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where 
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes  
import qualified Pipes.Internal as I

parseProducer
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text) m ()
parseProducer opt  = enumerate . enumerate_ 
                     . parseG opt 
                     . Select_ . Select

parseProducerLocations
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt = 
  enumerate . enumerate_ . parseLocationsG opt . Select_ . Select  

newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
    deriving (Functor, Monad, MonadPlus, MonadIO
             , Applicative, Alternative, Monoid, MonadTrans)

instance Monad m => List (ListT_ m) where
 type ItemM (ListT_ m) = m
 joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_) 
 runList   = liftM emend  . next  . enumerate . enumerate_
   where 
     emend (Right (a,q)) = Cons a (Select_ (Select q))
     emend _ = Nil

票数 2

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29450397

复制

相似问题

问如何用有限的资源解析Haskell中的大型XML文件？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用有限的资源解析Haskell中的大型XML文件？EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用有限的资源解析Haskell中的大型XML文件？
EN