首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >Haskell/GHC每线程内存成本

Haskell/GHC每线程内存成本
EN

Stack Overflow用户
提问于 2015-10-15 21:02:57
回答 1查看 907关注 0票数 20

我试图理解Haskell ( OS 10.10.5上的GHC 7.10.1 )中的(绿色)线程到底有多昂贵。我知道,与真正的OS线程相比,无论是内存使用还是CPU使用,它都非常便宜。

是的,所以我开始用forks n (绿色)线程(使用优秀的async库)编写一个超级简单的程序,然后让每个线程休眠m秒。

好吧,这很简单:

代码语言:javascript
复制
$ cat PerTheadMem.hs 
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]

首先,让我们编译并运行它:

代码语言:javascript
复制
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr

这应该会派生100k个线程,并在每个线程中等待10秒,然后打印一些信息:

代码语言:javascript
复制
$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0       648 colls,     0 par    0.373s   0.415s     0.0006s    0.0223s
Gen  1        11 colls,     0 par    0.298s   0.431s     0.0392s    0.1535s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time   79.062s  ( 92.803s elapsed)
GC      time    0.670s  (  0.846s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.065s  (  0.091s elapsed)
Total   time   79.798s  ( 93.740s elapsed)

%GC     time       0.8%  (0.9% elapsed)

Alloc rate    4,312,344 bytes per MUT second

Productivity  99.2% of total user, 84.4% of total elapsed


real    1m33.757s
user    1m19.799s
sys 0m2.260s

它花费了相当长的时间(1m33.757s),因为每个线程应该只等待10s,但我们已经构建了非线程,所以现在足够公平了。总而言之,我们使用了350MB,这还不错,每个线程有3.5KB。给定初始堆栈大小(-ki is 1 KB)。

对,但现在让我们在线程模式下编译,看看是否可以更快:

代码语言:javascript
复制
$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed)  Avg pause  Max pause
Gen  0      6435 colls,     0 par    0.860s   1.022s     0.0002s    0.0028s
Gen  1        20 colls,     0 par    2.206s   2.740s     0.1370s    0.3874s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.000s  (  0.001s elapsed)
MUT     time    0.879s  (  8.534s elapsed)
GC      time    3.066s  (  3.762s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time    0.000s  (  0.000s elapsed)
EXIT    time    0.074s  (  0.247s elapsed)
Total   time    4.021s  ( 12.545s elapsed)

Alloc rate    4,544,893,364 bytes per MUT second

Productivity  23.7% of total user, 7.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m12.565s
user    0m4.021s
sys 0m1.154s

哇,快多了,现在只有12秒,好多了。在Activity Monitor中,我发现100k绿色线程大致使用了4个OS线程,这是有道理的。

但是,3657MB总内存!这比使用的非线程版本多了10倍...

到目前为止,我还没有使用-prof-hy之类的工具进行性能分析。为了深入研究,我在单独的运行中执行了一些堆分析(-hy)。内存使用量在这两种情况下都没有变化,堆性能分析图看起来非常不同(左:非线程,右:线程),但我找不到10倍差异的原因。

通过分析分析输出(.prof文件),我也找不到任何真正的区别。

因此,我的问题是:内存使用量的10倍差异来自哪里?

EDIT:只想提一下:当程序甚至没有使用性能分析支持进行编译时,同样的区别也适用。因此,在ghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs上运行time ./PerTheadMem 100000 10 +RTS -sstderr是3559MB。而使用ghc -rtsopts -fforce-recomp PerTheadMem.hs则是395MB。

编辑2:在Linux (Linux 3.13.0-32-generic #57-Ubuntu SMP, x86_64上的GHC 7.10.2)上也会发生同样的情况:非线程化的460MB需要1m28.538秒,而线程化的是3483MB是12.604s。/usr/bin/time -v ...分别报告了Maximum resident set size (kbytes): 413684Maximum resident set size (kbytes): 1645384

EDIT 3:也将程序改为直接使用forkIO

代码语言:javascript
复制
import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)

main = do
    args <- getArgs
    let (numThreads, sleep) = case args of
                                numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                _ -> error "wrong args"
    mvar <- newEmptyMVar
    mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
          [1..numThreads]
    mapM_ (\_ -> takeMVar mvar) [1..numThreads]

而且它不会改变任何东西:非线程:152MB,线程:3308MB。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-10-16 13:46:54

嗯,罪魁祸首是threadDelay。*threadDelay**占用大量内存。这是一个与你的程序相当的程序,它在内存中表现得更好。它通过长时间运行的计算来确保所有线程同时运行。

代码语言:javascript
复制
uBound = 38
lBound = 34

doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
                  a = doSomething (n-1) 
                  b = doSomething (n-2) 
                in a `seq` b `seq` (a + b)
              | otherwise = doSomething (n `mod` uBound )

e :: Chan Integer -> Int -> IO ()
e mvar i = 
    do
        let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) ) 
        y `seq` writeChan mvar y

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
            dld = (sleep*1000*1000) 
        chan <- newChan
        mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
        putStrLn "All threads created"
        mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
        putStrLn "All read"

下面是计时统计数据:

代码语言:javascript
复制
 $ ghc -rtsopts -O -threaded  test.hs
 $ ./test 200 10 +RTS -sstderr -N4

 133,541,985,480 bytes allocated in the heap
     176,531,576 bytes copied during GC
         356,384 bytes maximum residency (16 sample(s))
          94,256 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     64246 colls, 64246 par    1.185s   0.901s     0.0000s    0.0274s
  Gen  1        16 colls,    15 par    0.004s   0.002s     0.0001s    0.0002s

  Parallel GC work balance: 65.96% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time   63.747s  ( 16.333s elapsed)
  GC      time    1.189s  (  0.903s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   64.938s  ( 17.239s elapsed)

  Alloc rate    2,094,861,384 bytes per MUT second

  Productivity  98.2% of total user, 369.8% of total elapsed

gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2

每个线程的最大驻留大小约为1.5kb。我对线程的数量和计算的运行长度进行了一些调整。由于线程在forkIO之后立即开始执行任务,因此创建100000个线程实际上需要很长时间。但是结果保持了1000个线程。

这是另一个threadDelay被“分解”出来的程序,这个程序不使用任何CPU,可以很容易地在100000个线程上执行:

代码语言:javascript
复制
e :: MVar () -> MVar () -> IO ()
e start end = 
    do
        takeMVar start
        putMVar end ()

main = 
    do
        args <- getArgs
        let (numThreads, sleep) = case args of
                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
                                    _ -> error "wrong args"
        starts <- mapM (const newEmptyMVar ) [1..numThreads]
        ends <- mapM (const newEmptyMVar ) [1..numThreads]
        mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
        mapM_ (\ start -> putMVar start () ) starts
        putStrLn "All threads created"
        threadDelay (sleep * 1000 * 1000)
        mapM_ (\ end -> takeMVar end ) ends
        putStrLn "All done"

结果是:

代码语言:javascript
复制
     129,270,632 bytes allocated in the heap
     404,154,872 bytes copied during GC
      77,844,160 bytes maximum residency (10 sample(s))
      10,929,688 bytes maximum slop
             165 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       128 colls,   128 par    0.178s   0.079s     0.0006s    0.0152s
  Gen  1        10 colls,     9 par    0.367s   0.137s     0.0137s    0.0325s

  Parallel GC work balance: 50.09% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.189s  ( 10.094s elapsed)
  GC      time    0.545s  (  0.217s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time    0.735s  ( 10.313s elapsed)

  Alloc rate    685,509,460 bytes per MUT second

  Productivity  25.9% of total user, 1.8% of total elapsed

在我的i5上,创建100000个线程并放置"start“mvar所需的时间不到1秒。峰值驻留在每个线程778字节左右,一点也不坏!

检查threadDelay的实现,我们可以看到线程化和非线程化的情况实际上是不同的:

https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay

然后在这里:https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html

看起来很清白。但是对于那些调用threadDelay的人来说,旧版本的base有一个神秘的(内存) doom的拼写:

https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121

如果仍然存在问题,很难说。然而,人们总是希望一个“现实生活”的并发程序不需要同时有太多的线程在threadDelay上等待。就我个人而言,从现在开始,我将密切关注我对threadDelay的使用。

票数 10
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/33149324

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档