前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Go调试简单的内存泄漏

Go调试简单的内存泄漏

作者头像
李海彬
发布2018-08-16 15:39:25
1.6K0
发布2018-08-16 15:39:25
举报
文章被收录于专栏:Golang语言社区Golang语言社区

原文作者:dm03514

SRE: Debugging Simple Memory Leaks in Go

Memory leaks are a class of bugs where memory is not released even after it is no longer needed. They are often explicit, and highly visible, which makes them a great candidate to begin learning debugging. Go is a language particularly well suited to identifying memory leaks because of its powerful toolchain, which ships with amazingly capable tools (pprof) which make pinpointing memory usage easy.

I’m hoping this post will illustrate how to visually identify memory, narrow it down to a specific process, correlate the process leak with work, and finally find the source of the memory leak using pprof. This post is contrived in order to allow for the simple identification of the root cause of the memory leak. The pprof overview is intentionally brief and aims to illustrate what it is capable of and isn’t an exhaustive overview of its features.

The service used to generate the data in this post is available here.

What is a memory leak?

If memory grows unbounded and never reaches a steady state then there is probably a leak. The key here is that memory grows without ever reaching a steady state, and eventually causes problems through explicit crashes or by impacting system performance.

Memory leaks can happen for any number of reasons. There can be logical leaks where data-structures grow unbounded, leaks from complexities of poor object reference handling, or just for any number of other reasons. Regardless of the source, many memory leaks elicit a visually noticeable pattern: the “sawtooth”.

Debug Process

This blog post is focused on exploring how to identify and pinpoint root cause for a go memory leak. We’ll be focusing primarily on characteristics of memory leaks, how to identify them, and how to determine their root cause using go. Because of this our actual debug process will be relatively superficial and informal.

The goal of our analysis is to progressively narrow scope of the problem by widdling away possibilities until we have enough information to form and propose a hypothesis. After we have enough data and reasonable scope of the cause, we should form a hypothesis and try to invalidate it with data.

Each step will try to either pinpoint a cause of an issue or invalidate a non-cause. Along the way we’ll be forming a series of hypotheses, they will be necessarily general at first then progressively more specific. This is loosely based on the scientific method. Brandon Gregg does an amazing job of covering different methodologies for system investigation (primarily focused on performance).

Just to reiterate we’ll try to:

  • -Ask a question
  • -Form a Hypothesis
  • -Analyze the hypothesis
  • -Repeat until the root cause is found

Identification

How do we even know if there is a problem (ie memory leak)? Explicit errors are direct indicators of an issue. For memory leaks a common error are OOM errors or explicit system crashes.

OOM errors

Errors are the most explicit indicator of a problem. User generated errors have the potential to generate false positives if the logic is off OOM is the OS literally indicating something is using too much memory. In the error below this manifests as cgroup limits being reached and the container being killed.

dmesg

代码语言:javascript
复制
 1 [14808.063890] main invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0                                                                                                                                                 [7/972]
 2 [14808.063893] main cpuset=34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 mems_allowed=0
 3 [14808.063899] CPU: 2 PID: 11345 Comm: main Not tainted 4.4.0-130-generic #156-Ubuntu
 4 [14808.063901] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 5 [14808.063902]  0000000000000286 ac45344c9134371f ffff8800b8727c88 ffffffff81401c43
 6 [14808.063906]  ffff8800b8727d68 ffff8800b87a5400 ffff8800b8727cf8 ffffffff81211a1e
 7 [14808.063908]  ffffffff81cdd014 ffff88006a355c00 ffffffff81e6c1e0 0000000000000206
 8 [14808.063911] Call Trace:
 9 [14808.063917]  [<ffffffff81401c43>] dump_stack+0x63/0x90
10 [14808.063928]  [<ffffffff81211a1e>] dump_header+0x5a/0x1c5
11 [14808.063932]  [<ffffffff81197dd2>] oom_kill_process+0x202/0x3c0
12 [14808.063936]  [<ffffffff81205514>] ? mem_cgroup_iter+0x204/0x3a0
13 [14808.063938]  [<ffffffff81207583>] mem_cgroup_out_of_memory+0x2b3/0x300
14 [14808.063941]  [<ffffffff8120836d>] mem_cgroup_oom_synchronize+0x33d/0x350
15 [14808.063944]  [<ffffffff812033c0>] ? kzalloc_node.constprop.49+0x20/0x20
16 [14808.063947]  [<ffffffff81198484>] pagefault_out_of_memory+0x44/0xc0
17 [14808.063967]  [<ffffffff8106d622>] mm_fault_error+0x82/0x160
18 [14808.063969]  [<ffffffff8106dae9>] __do_page_fault+0x3e9/0x410
19 [14808.063972]  [<ffffffff8106db32>] do_page_fault+0x22/0x30
20 [14808.063978]  [<ffffffff81855c58>] page_fault+0x28/0x30
21 [14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
22 [14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
23 [14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
24 [14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
25 [14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
26 [14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
27 [14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
28 [14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
29 [14808.063998] Memory cgroup stats for /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6: cache:108KB rss:197168KB rss_huge:0KB mapped_file:4KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:197168KB inacti
30 ve_file:88KB active_file:4KB unevictable:0KB
31 [14808.064008] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
32 [14808.064117] [10517]     0 10517    74852     4784      32       5        0             0 go
33 [14808.064121] [11344]     0 11344    97590    46185     113       5        0             0 main
34 [14808.064125] Memory cgroup out of memory: Kill process 11344 (main) score 904 or sacrifice child
35 [14808.083306] Killed process 11344 (main) total-vm:390360kB, anon-rss:183712kB, file-rss:1016kB

Question: Is the error a regular repeating issue?

Hypothesis: OOM errors are significant enough that they should rarely occur. There is a memory leak in one of the processes.

Prediction: Either the Process memory limit has been set too low and there was a uncharacteristic bump or there is a larger issue.

Test: Upon further inspection there are quite a few OOM errors suggesting this is a serious issue and not a one off. Check the system memory for historic view into memory usage.

System Memory

The next step after identifying a potential problem is to get an idea of system wide memory usage. Memory Leaks frequently display a “sawtooth” pattern. The spikes correspond to the application running while the dips correspond to a service restart.

Sawtooth characterizes a memory leak especially corresponding with a service deploy. I’m using a test project to illustrate memory leaks but even a slow leak would look like saw tooth if the range is zoomed out far enough. With a smaller time range it would look like a gradual rise and then a drop off on process restart.

The graph above shows the example of a sawtooth memory growth. Memory continually grows without flatlining. This is a smoking gun for memory issues.

Question: Which process (or processes) is (are) responsible for the memory growth?

Test: Analyze per process memory. There could be information in thedmesg logs to indicate a process or class of processes that are the targets of OOM.

Per Process Memory

Once a memory leak is suspected the next step is to identify a process that is contributing, or causing the system memory growth. Having per process historical memory metrics is a crucial requirement (container based system resources are available through the a tool like cAdvisor). Go’s prometheus client provides per process memory metrics by default, which is where the graph below gets its data.

The below graph shows a process which is very similar to the system sawtooth memory leak graph above: continual growth until process restarts.

Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.

Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.

Root Cause Analysis / Source Analysis

Go Memory Analysis

Once again prometheus gives us detailed information into the go runtime, and what our process is doing. The chart shows that bytes are continually allocated to the heap until a restart. Each dip corresponds with the service process restart.

Question: Which part(s) of the application is(are) leaking memory?

Question: Which part(s) of the application is(are) leaking memory?

Hypothesis: There’s a memory leak in a routine which is continually allocating memory to the heap (global variable or pointer, potentially visible through escape analysis)

Test: Correlate the memory usage with an event.

Correlation With Work

Establishing a correlation will help to partition the problem space by answering: is this happening online (in relation to transactions) or in the background?

One way to determine this could be to start the service and let it idle without applying any transactional load. Is the service leaking? If so it could be the framework or a shared library. Our example happens to have a strong correlation with transactional workload.

HTTP Request Count

The above graph show the count of HTTP requests. These directly match the system memory growth and time and establish diving into HTTP request handling as a good place to start.

Question: Which part of the application are responsible for the heap allocations?

Hypothesis: There is an HTTP handler that is continually allocating memory to the heap.

Test: Periodically analyze heap allocations during program running in order to track memory growth.

Go Memory Allocations

In order to inspect how much memory is being allocated and the source of those allocations we’ll use pprof. pprof is an absolutely amazing tool and one of the main reasons that I personally use go. In order to use it we’ll have to first enable it, and then take some snapshots. If you’re already using http, enabling it is literally as easy as:

代码语言:javascript
复制
1 import _ "net/http/pprof"

Once pprof is enabled we’ll periodically take heap snapshots throughout the life of process memory growth. Taking a heap snapshot is just as trivial:

代码语言:javascript
复制
1 curl http://localhost:8080/debug/pprof/heap > heap.0.pprof
2 sleep 30
3 curl http://localhost:8080/debug/pprof/heap > heap.1.pprof
4 sleep 30
5 curl http://localhost:8080/debug/pprof/heap > heap.2.pprof
6 sleep 30
7 curl http://localhost:8080/debug/pprof/heap > heap.3.pprof

The goal is to get an idea of how memory is growing throughout the life of the program. Let’s inspect the most recent heap snapshot:

代码语言:javascript
复制
 1 $ go tool pprof pprof/heap.3.pprof
 2 Local symbolization failed for main: open /tmp/go-build598947513/b001/exe/main: no such file or directory
 3 Some binary filenames not available. Symbolization may be incomplete.
 4 Try setting PPROF_BINARY_PATH to the search path for local binaries.
 5 File: main
 6 Type: inuse_space
 7 Time: Jul 30, 2018 at 6:11pm (UTC)
 8 Entering interactive mode (type "help" for commands, "o" for options)
 9 (pprof) svg
10 Generating report in profile002.svg
11 (pprof) top20
12 Showing nodes accounting for 410.75MB, 99.03% of 414.77MB total
13 Dropped 10 nodes (cum <= 2.07MB)
14      flat  flat%   sum%        cum   cum%
15  408.97MB 98.60% 98.60%   408.97MB 98.60%  bytes.Repeat
16    1.28MB  0.31% 98.91%   410.25MB 98.91%  main.(*RequestTracker).Track
17    0.50MB  0.12% 99.03%   414.26MB 99.88%  net/http.(*conn).serve
18         0     0% 99.03%   410.25MB 98.91%  main.main.func1
19         0     0% 99.03%   410.25MB 98.91%  net/http.(*ServeMux).ServeHTTP
20         0     0% 99.03%   410.25MB 98.91%  net/http.HandlerFunc.ServeHTTP
21         0     0% 99.03%   410.25MB 98.91%  net/http.serverHandler.ServeHTTP

This is absolutely amazing. pprof defaults to Type: inuse_space which displays all the objects that are currently in memory at the time of the snapshot. We can see here that bytes.Repeat is directly responsible for 98.60% of all of our memory!!!

The line below the bytes.Repeat entry shows:

代码语言:javascript
复制
1 1.28MB  0.31% 98.91%   410.25MB 98.91%  main.(*RequestTracker).Track

This is really interesting, it shows that Track itself has1.28MB or 0.31% but is responsible for 98.91% of all in use memory!!!!!!!!!!!!! Further more we can see that http has even less memory in use but is responsible for even more than Track (since Track is called from it).

pprof exposes many ways to introspect and visualize memory (in use memory size, in use number of objects, allocated memory size, allocated memory objects), it allows listing the track method and showing how much each line is responsible for:

代码语言:javascript
复制
 1 (pprof) list Track
 2 Total: 414.77MB
 3 ROUTINE ======================== main.(*RequestTracker).Track in /vagrant_data/go/src/github.com/dm03514/grokking-go/simple-memory-leak/main.go
 4    1.28MB   410.25MB (flat, cum) 98.91% of Total
 5         .          .     19:
 6         .          .     20:func (rt *RequestTracker) Track(req *http.Request) {
 7         .          .     21:   rt.mu.Lock()
 8         .          .     22:   defer rt.mu.Unlock()
 9         .          .     23:   // alloc 10KB for each track
10    1.28MB   410.25MB     24:   rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))
11         .          .     25:}
12         .          .     26:
13         .          .     27:var (
14         .          .     28:   requests RequestTracker
15         .          .     29:

This directly pinpoints the culprit:

代码语言:javascript
复制
11.28MB   410.25MB     24:   rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))

pprof also allows visual generation of the textual information above:

代码语言:javascript
复制
1(pprof) svg
2Generating report in profile003.svg

This clearly shows the current objects occupying the process memory. Now that we have the culprit Track we can verify that it is allocating memory to a global without ever cleaning it up, and fix the root issue.

Resolution: Memory was being continually allocated to a global variable on each HTTP request. The memory growth in this case should have been inspected.

I hope that this post was able to illustrate the power of visually identifying memory leaks and a loose process for progressively narrowing down its source. Finally, I hope that it was able to touch on the power of pprof for go memory introspection and analysis. As always I would love any feedback. Thank you for reading.


版权申明:内容来源网络,版权归原创者所有。除非无法确认,我们都会标明作者及出处,如有侵权烦请告知,我们会立即删除并表示歉意。谢谢。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-08-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Golang语言社区 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • SRE: Debugging Simple Memory Leaks in Go
    • What is a memory leak?
      • Debug Process
        • Identification
          • Root Cause Analysis / Source Analysis
          相关产品与服务
          容器服务
          腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档