SRE: Debugging Simple Memory Leaks in Go

Memory leaks are a class of bugs where memory is not released even after it is no longer needed. They are often explicit, and highly visible, which makes them a great candidate to begin learning debugging. Go is a language particularly well suited to identifying memory leaks because of its powerful toolchain, which ships with amazingly capable tools (pprof) which make pinpointing memory usage easy.

I’m hoping this post will illustrate how to visually identify memory, narrow it down to a specific process, correlate the process leak with work, and finally find the source of the memory leak using pprof. This post is contrived in order to allow for the simple identification of the root cause of the memory leak. The pprof overview is intentionally brief and aims to illustrate what it is capable of and isn’t an exhaustive overview of its features.

The service used to generate the data in this post is available here.

What is a memory leak?

If memory grows unbounded and never reaches a steady state then there is probably a leak. The key here is that memory grows without ever reaching a steady state, and eventually causes problems through explicit crashes or by impacting system performance.

Memory leaks can happen for any number of reasons. There can be logical leaks where data-structures grow unbounded, leaks from complexities of poor object reference handling, or just for any number of other reasons. Regardless of the source, many memory leaks elicit a visually noticeable pattern: the “sawtooth”.

Debug Process

This blog post is focused on exploring how to identify and pinpoint root cause for a go memory leak. We’ll be focusing primarily on characteristics of memory leaks, how to identify them, and how to determine their root cause using go. Because of this our actual debug process will be relatively superficial and informal.

The goal of our analysis is to progressively narrow scope of the problem by widdling away possibilities until we have enough information to form and propose a hypothesis. After we have enough data and reasonable scope of the cause, we should form a hypothesis and try to invalidate it with data.

Each step will try to either pinpoint a cause of an issue or invalidate a non-cause. Along the way we’ll be forming a series of hypotheses, they will be necessarily general at first then progressively more specific. This is loosely based on the scientific method. Brandon Gregg does an amazing job of covering different methodologies for system investigation (primarily focused on performance).

Just to reiterate we’ll try to:

  • -Ask a question
  • -Form a Hypothesis
  • -Analyze the hypothesis
  • -Repeat until the root cause is found


How do we even know if there is a problem (ie memory leak)? Explicit errors are direct indicators of an issue. For memory leaks a common error are OOM errors or explicit system crashes.

OOM errors

Errors are the most explicit indicator of a problem. User generated errors have the potential to generate false positives if the logic is off OOM is the OS literally indicating something is using too much memory. In the error below this manifests as cgroup limits being reached and the container being killed.


 1 [14808.063890] main invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0                                                                                                                                                 [7/972]
 2 [14808.063893] main cpuset=34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 mems_allowed=0
 3 [14808.063899] CPU: 2 PID: 11345 Comm: main Not tainted 4.4.0-130-generic #156-Ubuntu
 4 [14808.063901] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 5 [14808.063902]  0000000000000286 ac45344c9134371f ffff8800b8727c88 ffffffff81401c43
 6 [14808.063906]  ffff8800b8727d68 ffff8800b87a5400 ffff8800b8727cf8 ffffffff81211a1e
 7 [14808.063908]  ffffffff81cdd014 ffff88006a355c00 ffffffff81e6c1e0 0000000000000206
 8 [14808.063911] Call Trace:
 9 [14808.063917]  [<ffffffff81401c43>] dump_stack+0x63/0x90
10 [14808.063928]  [<ffffffff81211a1e>] dump_header+0x5a/0x1c5
11 [14808.063932]  [<ffffffff81197dd2>] oom_kill_process+0x202/0x3c0
12 [14808.063936]  [<ffffffff81205514>] ? mem_cgroup_iter+0x204/0x3a0
13 [14808.063938]  [<ffffffff81207583>] mem_cgroup_out_of_memory+0x2b3/0x300
14 [14808.063941]  [<ffffffff8120836d>] mem_cgroup_oom_synchronize+0x33d/0x350
15 [14808.063944]  [<ffffffff812033c0>] ? kzalloc_node.constprop.49+0x20/0x20
16 [14808.063947]  [<ffffffff81198484>] pagefault_out_of_memory+0x44/0xc0
17 [14808.063967]  [<ffffffff8106d622>] mm_fault_error+0x82/0x160
18 [14808.063969]  [<ffffffff8106dae9>] __do_page_fault+0x3e9/0x410
19 [14808.063972]  [<ffffffff8106db32>] do_page_fault+0x22/0x30
20 [14808.063978]  [<ffffffff81855c58>] page_fault+0x28/0x30
21 [14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
22 [14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
23 [14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
24 [14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
25 [14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
26 [14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
27 [14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
28 [14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
29 [14808.063998] Memory cgroup stats for /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6: cache:108KB rss:197168KB rss_huge:0KB mapped_file:4KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:197168KB inacti
30 ve_file:88KB active_file:4KB unevictable:0KB
31 [14808.064008] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
32 [14808.064117] [10517]     0 10517    74852     4784      32       5        0             0 go
33 [14808.064121] [11344]     0 11344    97590    46185     113       5        0             0 main
34 [14808.064125] Memory cgroup out of memory: Kill process 11344 (main) score 904 or sacrifice child
35 [14808.083306] Killed process 11344 (main) total-vm:390360kB, anon-rss:183712kB, file-rss:1016kB

Question: Is the error a regular repeating issue?

Hypothesis: OOM errors are significant enough that they should rarely occur. There is a memory leak in one of the processes.

Prediction: Either the Process memory limit has been set too low and there was a uncharacteristic bump or there is a larger issue.

Test: Upon further inspection there are quite a few OOM errors suggesting this is a serious issue and not a one off. Check the system memory for historic view into memory usage.

System Memory

The next step after identifying a potential problem is to get an idea of system wide memory usage. Memory Leaks frequently display a “sawtooth” pattern. The spikes correspond to the application running while the dips correspond to a service restart.

Sawtooth characterizes a memory leak especially corresponding with a service deploy. I’m using a test project to illustrate memory leaks but even a slow leak would look like saw tooth if the range is zoomed out far enough. With a smaller time range it would look like a gradual rise and then a drop off on process restart.

The graph above shows the example of a sawtooth memory growth. Memory continually grows without flatlining. This is a smoking gun for memory issues.

Question: Which process (or processes) is (are) responsible for the memory growth?

Test: Analyze per process memory. There could be information in thedmesg logs to indicate a process or class of processes that are the targets of OOM.

Per Process Memory

Once a memory leak is suspected the next step is to identify a process that is contributing, or causing the system memory growth. Having per process historical memory metrics is a crucial requirement (container based system resources are available through the a tool like cAdvisor). Go’s prometheus client provides per process memory metrics by default, which is where the graph below gets its data.

The below graph shows a process which is very similar to the system sawtooth memory leak graph above: continual growth until process restarts.

Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.

Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.

Root Cause Analysis / Source Analysis

Go Memory Analysis

Once again prometheus gives us detailed information into the go runtime, and what our process is doing. The chart shows that bytes are continually allocated to the heap until a restart. Each dip corresponds with the service process restart.

Question: Which part(s) of the application is(are) leaking memory?

Question: Which part(s) of the application is(are) leaking memory?

Hypothesis: There’s a memory leak in a routine which is continually allocating memory to the heap (global variable or pointer, potentially visible through escape analysis)

Test: Correlate the memory usage with an event.

Correlation With Work

Establishing a correlation will help to partition the problem space by answering: is this happening online (in relation to transactions) or in the background?

One way to determine this could be to start the service and let it idle without applying any transactional load. Is the service leaking? If so it could be the framework or a shared library. Our example happens to have a strong correlation with transactional workload.

HTTP Request Count

The above graph show the count of HTTP requests. These directly match the system memory growth and time and establish diving into HTTP request handling as a good place to start.

Question: Which part of the application are responsible for the heap allocations?

Hypothesis: There is an HTTP handler that is continually allocating memory to the heap.

Test: Periodically analyze heap allocations during program running in order to track memory growth.

Go Memory Allocations

In order to inspect how much memory is being allocated and the source of those allocations we’ll use pprof. pprof is an absolutely amazing tool and one of the main reasons that I personally use go. In order to use it we’ll have to first enable it, and then take some snapshots. If you’re already using http, enabling it is literally as easy as:

1 import _ "net/http/pprof"

Once pprof is enabled we’ll periodically take heap snapshots throughout the life of process memory growth. Taking a heap snapshot is just as trivial:

1 curl http://localhost:8080/debug/pprof/heap > heap.0.pprof
2 sleep 30
3 curl http://localhost:8080/debug/pprof/heap > heap.1.pprof
4 sleep 30
5 curl http://localhost:8080/debug/pprof/heap > heap.2.pprof
6 sleep 30
7 curl http://localhost:8080/debug/pprof/heap > heap.3.pprof

The goal is to get an idea of how memory is growing throughout the life of the program. Let’s inspect the most recent heap snapshot:

 1 $ go tool pprof pprof/heap.3.pprof
 2 Local symbolization failed for main: open /tmp/go-build598947513/b001/exe/main: no such file or directory
 3 Some binary filenames not available. Symbolization may be incomplete.
 4 Try setting PPROF_BINARY_PATH to the search path for local binaries.
 5 File: main
 6 Type: inuse_space
 7 Time: Jul 30, 2018 at 6:11pm (UTC)
 8 Entering interactive mode (type "help" for commands, "o" for options)
 9 (pprof) svg
10 Generating report in profile002.svg
11 (pprof) top20
12 Showing nodes accounting for 410.75MB, 99.03% of 414.77MB total
13 Dropped 10 nodes (cum <= 2.07MB)
14      flat  flat%   sum%        cum   cum%
15  408.97MB 98.60% 98.60%   408.97MB 98.60%  bytes.Repeat
16    1.28MB  0.31% 98.91%   410.25MB 98.91%  main.(*RequestTracker).Track
17    0.50MB  0.12% 99.03%   414.26MB 99.88%  net/http.(*conn).serve
18         0     0% 99.03%   410.25MB 98.91%  main.main.func1
19         0     0% 99.03%   410.25MB 98.91%  net/http.(*ServeMux).ServeHTTP
20         0     0% 99.03%   410.25MB 98.91%  net/http.HandlerFunc.ServeHTTP
21         0     0% 99.03%   410.25MB 98.91%  net/http.serverHandler.ServeHTTP

This is absolutely amazing. pprof defaults to Type: inuse_space which displays all the objects that are currently in memory at the time of the snapshot. We can see here that bytes.Repeat is directly responsible for 98.60% of all of our memory!!!

The line below the bytes.Repeat entry shows:

1 1.28MB  0.31% 98.91%   410.25MB 98.91%  main.(*RequestTracker).Track

This is really interesting, it shows that Track itself has1.28MB or 0.31% but is responsible for 98.91% of all in use memory!!!!!!!!!!!!! Further more we can see that http has even less memory in use but is responsible for even more than Track (since Track is called from it).

pprof exposes many ways to introspect and visualize memory (in use memory size, in use number of objects, allocated memory size, allocated memory objects), it allows listing the track method and showing how much each line is responsible for:

 1 (pprof) list Track
 2 Total: 414.77MB
 3 ROUTINE ======================== main.(*RequestTracker).Track in /vagrant_data/go/src/github.com/dm03514/grokking-go/simple-memory-leak/main.go
 4    1.28MB   410.25MB (flat, cum) 98.91% of Total
 5         .          .     19:
 6         .          .     20:func (rt *RequestTracker) Track(req *http.Request) {
 7         .          .     21:   rt.mu.Lock()
 8         .          .     22:   defer rt.mu.Unlock()
 9         .          .     23:   // alloc 10KB for each track
10    1.28MB   410.25MB     24:   rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))
11         .          .     25:}
12         .          .     26:
13         .          .     27:var (
14         .          .     28:   requests RequestTracker
15         .          .     29:

This directly pinpoints the culprit:

11.28MB   410.25MB     24:   rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))

pprof also allows visual generation of the textual information above:

1(pprof) svg
2Generating report in profile003.svg

This clearly shows the current objects occupying the process memory. Now that we have the culprit Track we can verify that it is allocating memory to a global without ever cleaning it up, and fix the root issue.

Resolution: Memory was being continually allocated to a global variable on each HTTP request. The memory growth in this case should have been inspected.

I hope that this post was able to illustrate the power of visually identifying memory leaks and a loose process for progressively narrowing down its source. Finally, I hope that it was able to touch on the power of pprof for go memory introspection and analysis. As always I would love any feedback. Thank you for reading.


原文发布于微信公众号 - Golang语言社区(Golangweb)





0 条评论
登录 后参与评论



转载:Package by feature, not layer

The first question in building an application is "How do I divide it up into pac...


Microsoft training Kits

Microsoft training kits对于开始学习一门新技术的时候是一个非常好的资料.下面是一些training kits列表: .NET Framew...


Knapsack problem algorithms for my real-life carry-on knapsack

I'm a nomad and live out of one carry-on bag. This means that the total weight o...


Spring Boot 与 Kotlin 上传文件

如果我们做一个小型的web站,而且刚好选择的kotlin 和Spring Boot技术栈,那么上传文件的必不可少了,当然,如果你做一个中大型的web站,那建议你...


Async方法死锁的问题 Don't Block on Async Code(转)




Windows注册表操作基础代码     Windows下对注册表进行操作使用的一段基础代码Reg.h: #pragma once #include<asser...


Package Stopped State Since Android 3.1

Since Android 3.1, Android has introduced a LaunchControl mechanism. It’s call S...


Issues About Installing Octopress

Actually I am fresh to Write Blog with Octopress in Github Pages.According to th...




来自专栏Android 开发学习

Android studio 2.2 JNI ffmpeg 简单的播放器(这里只有视频解码)include "libavcodec/avcodec.h"include "libavformat/avf