Memory Leak Caused by Using Go STL to Query DB

Problem Origin

Recently, a Go API service was found to have continuously increasing memory usage through regular monitoring. The memory usage rose from the initial 70MB to over 1GB until it ran out of memory and crashed. This clearly indicated a memory leak issue. Since memory leaks rarely occur in languages with built-in garbage collection, if there is one, it must be a significant problem. Therefore, the following detailed investigation process was carried out to accumulate experience for handling similar issues in the future.

Analysis and Investigation

Goroutine Leak

I had previously heard a famous saying: “Nine out of ten memory leaks in Go are due to goroutine leaks.” Is this really the case? Regular queries of runtime.NumGoroutine() can help us determine this, or if pprof is enabled online, we can visit /debug/pprof to see the number of goroutines. I used expvar to periodically expose the total number of goroutines and regularly collected the online data locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// server.go
// Expose statistical data via expvar.Handler to the HTTP service
mux := http.NewServeMux()
mux.Handle("/debug/vars", expvar.Handler())


// main.go
func sampleInterval() chan struct{} {
done := make(chan struct{})
go func() {
ticker := time.NewTicker(1 * time.Second)
numG := expvar.NewInt("runtime.goroutines")
defer ticker.Stop()

for {
select {
case <-ticker.C:
numG.Set(int64(runtime.NumGoroutine()))
case <-done:
return
}
}
}()
return done
}

// Run locally
// curl http://host:port/debug/vars
{
"runtime.goroutines": 48
}

The done channel is returned so that we can gracefully exit when shutting down the system. After some time of data collection and chart analysis, I found that it was not a goroutine leak. If it were, we could view the status of each goroutine online by visiting /debug/pprof/goroutine?debug=1 for further analysis. However, through the process of elimination, this was clearly not the cause of the issue.

Long-term Reference Holding

Now a new question naturally emerges: if the number of goroutines rises and then always manages to fall back, it means that non-resident goroutines can finish normally. We can rule out leaks caused by goroutines being blocked and unable to exit. But can we conclude that memory leaks are completely unrelated to goroutines? The answer is no, because we cannot be sure whether goroutines might cause some of our reference objects to be continuously held. During the Mark phase, these objects that are still being held certainly cannot be garbage collected. If they are being held for some reason and accompanied by new memory allocations, this is the second major culprit leading to continuous memory growth.

So, the problem has shifted to the memory allocation of objects or object groups. This could be a large slice that continuously adds elements while being referenced throughout the object’s lifecycle, or it could be an object holding pointers to other objects, like a linked list reference relationship. If this reference chain keeps growing, it will similarly cause memory leaks.

Using pprof to Sample and Analyze Heap Memory

By obtaining pprof heap profiles, we can directly generate snapshots of the current heap memory allocation online and persist them in the form of .pb.gz compressed packages for further offline analysis.

1
go tool pprof http://[domain]/debug/pprof/heap

Using the command line, we can persist the heap memory snapshot and various statistics locally for analysis, entering the pprof interactive command line. I directly captured two snapshots from the production environment at different times for analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
File: binary
Type: inuse_space
Time: Feb 27, 2020 at 12:06pm (CST)
(pprof) top
Showing nodes accounting for 44174.92kB, 93.50% of 47247.81kB total
Showing top 10 nodes out of 114
flat flat% sum% cum cum%
35842.64kB 75.86% 75.86% 35842.64kB 75.86% myfuncA
1825.78kB 3.86% 79.73% 1825.78kB 3.86% github.com/mozillazg/go-pinyin.init
1805.17kB 3.82% 83.55% 2354.01kB 4.98% compress/flate.NewWriter
1024.14kB 2.17% 85.71% 2048.25kB 4.34% database/sql.(*DB).prepareDC
1024.08kB 2.17% 87.88% 1024.08kB 2.17% net.(*file).getLineFromData
561.50kB 1.19% 89.07% 561.50kB 1.19% html.populateMaps
548.84kB 1.16% 90.23% 548.84kB 1.16% compress/flate.(*compressor).initDeflate
516.76kB 1.09% 91.32% 516.76kB 1.09% net/http.(*http2Framer).WriteDataPadded
513.31kB 1.09% 92.41% 513.31kB 1.09% regexp.onePassCopy
512.69kB 1.09% 93.50% 512.69kB 1.09% vendor/golang.org/x/net/http2/hpack.(*headerFieldTable).addEntry
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
File: binary
Type: inuse_space
Time: Feb 27, 2020 at 5:58pm (CST)
(pprof) top
Showing nodes accounting for 80.65MB, 83.62% of 96.45MB total
Showing top 10 nodes out of 145
flat flat% sum% cum cum%
35MB 36.29% 36.29% 35MB 36.29% myfuncA
8.50MB 8.81% 45.10% 11MB 11.41% github.com/lib/pq.parseStatementRowDescribe
7.05MB 7.31% 52.42% 7.05MB 7.31% database/sql.(*DB).addDepLocked
6MB 6.22% 58.64% 6MB 6.22% encoding/json.(*decodeState).literalStore
5MB 5.18% 63.82% 19.50MB 20.22% github.com/lib/pq.(*conn).prepareTo
4.59MB 4.76% 68.58% 4.59MB 4.76% reflect.unsafe_NewArray
4.50MB 4.67% 73.25% 11MB 11.41% queryfuncB
4MB 4.15% 77.40% 18.56MB 19.24% database/sql.(*DB).prepareDC
3.50MB 3.63% 81.03% 3.50MB 3.63% github.com/lib/pq.decideColumnFormats

Using top, we can view the functions holding the most memory, sorted accordingly. By comparing the functions holding the most memory, I found that although myfuncA held a lot of memory, its memory usage remained unchanged between the two snapshots, while the total memory usage increased from 47MB to 96MB. Therefore, we can rule out myfuncA as the source of the leak. It’s worth noting that not all of the additional 50MB is due to memory leaks; DB query deserialization also requires significant memory allocation. According to Go runtime’s memory allocation strategy, GC does not immediately reclaim this memory, and even if it does, it won’t quickly return idle memory to the OS. This memory will still exist in the memory allocator’s multi-level memory pool for some time. Therefore, there are still some interfering factors hindering our analysis. By the way, regarding the reflect.unsafe_NewArray function, some people previously suspected it might cause memory leaks. In fact, whenever a JSON is deserialized into a slice, it uses this function to allocate memory. Its upper layer is reflect.MakeSlice, which can create a slice based on type information obtained through reflection, essentially a constructor for a generic slice.

Comparing Differences Between Two Snapshots

Simply visually inspecting the differences between the two snapshots still has a lot of interference. It might be clearer if we could use the first snapshot as a baseline and view the diff results with the second snapshot.

1
go tool pprof -base pprof.binary.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz pprof.binary.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
File: binary
Type: inuse_space
Time: Feb 27, 2020 at 12:06pm (CST)
(pprof) top
Showing nodes accounting for 37.65MB, 76.19% of 49.42MB total
Dropped 1 node (cum <= 0.25MB)
Showing top 10 nodes out of 118
flat flat% sum% cum cum%
6.50MB 13.16% 13.16% 9MB 18.21% github.com/lib/pq.parseStatementRowDescribe
5.05MB 10.23% 23.38% 5.05MB 10.23% database/sql.(*DB).addDepLocked
5MB 10.12% 33.50% 5MB 10.12% encoding/json.(*decodeState).literalStore
4.58MB 9.28% 42.78% 4.58MB 9.28% reflect.unsafe_NewArray
4MB 8.10% 50.87% 8MB 16.19% funcB (callByA)
4MB 8.10% 58.97% 16MB 32.38% github.com/lib/pq.(*conn).prepareTo
3MB 6.07% 65.04% 3MB 6.07% github.com/lib/pq.decideColumnFormats
2.50MB 5.06% 70.10% 2.50MB 5.06% github.com/lib/pq.(*readBuf).string
1.51MB 3.05% 73.15% 1.51MB 3.05% github.com/lib/pq.textDecode
1.50MB 3.04% 76.19% 39.10MB 79.12% ctrlfuncA

By specifying the baseline with -base and using top again, we can accurately list what makes up the extra 50MB. Categorizing these listed methods: funcA is the controller function, and funcB is the M-layer method called by A that directly interacts with PostgreSQL to obtain data. The rest includes standard library functions for deserializing JSON data fields and functions related to database connections. Initially, we can be certain that the memory leak exists somewhere in this method call chain.

In the interactive command line, using list methodName allows us to analyze the specific usage situations step by step along the method call chain. Excluding the interference of memory allocation for deserialized objects and combining with the occupancy situation of snapshot 2, we can locate the *DB and *conn objects, with method calls pointing to the Preparex method provided by sqlx and the Get method of the Stmt object produced by Preparex.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func (r repoXXX) FindByID(id string) (*T, err) {
stmt, err := r.Preparex(`MY SQL QUERY`) // ???
if err != nil {
return nil, err
}

t := new(T)
err = stmt.Get(t, id) // ???
if err != nil {
if err == sql.ErrNoRows {
return nil, errno.TorNotFound
}
return nil, err
}
return t, nil
}

Finding the Culprit

sqlx is just a simple wrapper library for database/sql. To understand what these two methods actually do, we need to start from database/sql.

DB Creation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func Open(driverName, dataSourceName string) (*DB, error) {
// Check if the database driver is already initialized
driversMu.RLock()
driveri, ok := drivers[driverName]
driversMu.RUnlock()
// Not initialized, report an error
if !ok {
return nil, fmt.Errorf("sql: unknown driver %q (forgotten import?)", driverName)
}

if driverCtx, ok := driveri.(driver.DriverContext); ok {
// Pass configuration information for connecting to the database via URI
connector, err := driverCtx.OpenConnector(dataSourceName)
if err != nil {
return nil, err
}
// Initialize the DB object and return
return OpenDB(connector), nil
}

// Default connector
return OpenDB(dsnConnector{dsn: dataSourceName, driver: driveri}), nil
}

OpenDB() is obviously the DB constructor, let’s continue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func OpenDB(c driver.Connector) *DB {
ctx, cancel := context.WithCancel(context.Background())
db := &DB{
connector: c,
openerCh: make(chan struct{}, connectionRequestQueueSize),
resetterCh: make(chan *driverConn, 50),
lastPut: make(map[*driverConn]string),
connRequests: make(map[uint64]chan connRequest),
stop: cancel,
}

go db.connectionOpener(ctx)
go db.connectionResetter(ctx)

return db
}

Several key components:
openerCh is used to cache requests for creating new connections, with a single purpose and following a first-in-first-out logic, defined as chan struct which is perfect for this use case

resetterCh is used to asynchronously reset the connection’s context and return the current connection’s query execution error

Two goroutines asynchronously execute the above two operations

lastPut is used for debugging, we’ll skip it

connRequests is used for asynchronous request queuing when there are no idle database connections and no new connections can be created. The key is incremental, and the value uses a channel to prevent blocking when reading. The length can easily be used for counting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
func (db *DB) putConnDBLocked(dc *driverConn, err error) bool {
// ...
// Discovers that there are requests in the queue
if c := len(db.connRequests); c > 0 {
var req chan connRequest
var reqKey uint64
for reqKey, req = range db.connRequests {
break
}
delete(db.connRequests, reqKey) // Remove from the queue
if err == nil {
dc.inUse = true
}
// Allocate a connection for this request
req <- connRequest{
conn: dc, // *driverConn is the database connection
err: err,
}
return true
}

}

So far, we know that connRequest represents a request to create a new connection, and connRequests is used for global request queuing. Whenever there’s a request to create a new connection, DB tries to reuse available driverConn to satisfy these requests. Note that these are not query requests, but requests to create connections.

Recording Resource Dependencies

The DB object’s conn method is responsible for obtaining connections. Once we have a connection, we can send our statements to the database. Taking a SELECT statement query as an example:

1
2
3
4
5
6
7
8
9
10
func (db *DB) query(ctx context.Context, query string, args []interface{}, strategy connReuseStrategy) (*Rows, error) {
// Get a connection or return an error if it fails
dc, err := db.conn(ctx, strategy)
if err != nil {
return nil, err
}

// The familiar stmt, Rows object is created here, bound to driverConn execution and then returned
return db.queryDC(ctx, nil, dc, dc.releaseConn, query, args)
}

Let’s not go too far ahead, as we’re not using this method to access the database. But the above call can help clarify how the standard library organizes these underlying objects. Now, let’s get back to sqlx’s preparex method, which is where our problem begins. It ultimately calls the PrepareContext method of the standard library’s DB:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func (db *DB) PrepareContext(ctx context.Context, query string) (*Stmt, error) {
var stmt *Stmt
var err error
// Set a retry count for failures, use a flag to control the connection acquisition strategy (prioritize using cached connections)
for i := 0; i < maxBadConnRetries; i++ {
stmt, err = db.prepare(ctx, query, cachedOrNewConn)
if err != driver.ErrBadConn {
break
}
}
if err == driver.ErrBadConn {
return db.prepare(ctx, query, alwaysNewConn)
}
return stmt, err
}
1
2
3
4
5
6
7
8
func (db *DB) prepare(ctx context.Context, query string, strategy connReuseStrategy) (*Stmt, error) {
// Similarly tries to get a connection or returns an error if it fails
dc, err := db.conn(ctx, strategy)
if err != nil {
return nil, err
}
return db.prepareDC(ctx, dc, dc.releaseConn, nil, query)
}

The prepare method here is very similar to the query method mentioned earlier. We’ll discuss their differences later. For now, let’s continue following the prepareDC clue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// cg being nil means the connection pool is enabled! (emphasizing this point)
func (db *DB) prepareDC(ctx context.Context, dc *driverConn, release func(error), cg stmtConnGrabber, query string) (*Stmt, error) {
var ds *driverStmt
var err error
defer func() {
release(err)
}()
withLock(dc, func() {
ds, err = dc.prepareLocked(ctx, cg, query)
})
if err != nil {
return nil, err
}
stmt := &Stmt{
db: db,
query: query,
cg: cg,
cgds: ds,
}

// This part is the real purpose of preparex, using the connection pool
if cg == nil {
// Prepare saves the obtained connection and statement in advance
stmt.css = []connStmt{{dc, ds}}
// Records the total number of currently closed connections, used when cleaning up closed connections
stmt.lastNumClosed = atomic.LoadUint64(&db.numClosed)
// Adds dependency reference counting (not clear what it's for at this point), but stmt is definitely being held
db.addDep(stmt, stmt)
}
return stmt, nil
}

The returned stmt is the product of Preparex, which then executes Get. After sqlx’s wrapper, it ultimately calls the standard library’s func (s *Stmt) QueryContext and sqlx’s func (r *Row) scanAny.

It’s certain that after calling Preparex, the stmt reference is held by db. As long as the db object exists, for stmt to be released, there must be corresponding code to manually remove the reference to stmt. Let’s briefly look at QueryContext to see if there’s an operation similar to removeDep that we’re looking for:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (s *Stmt) QueryContext(ctx context.Context, args ...interface{}) (*Rows, error) {
// ...
rowsi, err = rowsiFromStatement(ctx, dc.ci, ds, args...)
if err == nil {
rows := &Rows{
dc: dc,
rowsi: rowsi,
}
s.db.addDep(s, rows) // HERE

rows.releaseConn = func(err error) {
releaseConn(err)
s.db.removeDep(s, rows) // HERE
}
var txctx context.Context
if s.cg != nil {
txctx = s.cg.txCtx()
}
rows.initContextClose(ctx, txctx)
return rows, nil
}

releaseConn(err)
}

Yes, but these are codes for recording references to rows and reusing database connections, not for releasing the stmt object. Similarly, looking at sqlx’s scanAny, it mainly serializes the data returned by the database to specific fields in the struct passed to Get by calling Scan.

The above findings confirm the culprit of this memory leak: db.dep, a dictionary used for reference counting to ensure that dependencies are released first. This dictionary ensures that only after the element’s dependencies are released can the current element be released, calling the current element’s finalClose method to release the resources held by the current element.

Solving the Leak

Therefore, to solve this memory leak, we must make dep release its reference to stmt. Through a simple search of the sql.go file, the release of the stmt object is hidden in its close method.

1
2
3
4
5
6
func (s *Stmt) Close() error {
// ...
if s.cg == nil {
return s.db.removeDep(s, s) // Actually releases stmt, no longer holding its reference
}
}

One line of code can solve it:

1
2
3
4
stmt, _ := r.Preparex(`MY SQL QUERY`)
// This line is essential!!!
defer stmt.Close()
// ...

After redeployment and sampling the memory usage, the results are shown in the following figure:

Before and After Optimization Comparison

GC can now properly reclaim the stmt object. Problem solved!

Conclusion

Lessons learned:

  • Early stage: Perform load tests, solve issues offline if possible; make good use of pprof; write good documentation to avoid others falling into the same trap
  • Production: Ensure monitoring and alerting are in place, memory usage exceeding a certain threshold must be taken seriously, make good use of http/pprof
  • Regular practices: Become more familiar with frequently used packages and standard libraries, especially the standard library, as it hides many details that can easily lead to issues if overlooked

Other Potential Leaks (Bonus)

Honestly, this is not the first memory leak I’ve encountered in Go, but Go indeed doesn’t easily have such issues. Through reflection on my multiple encounters with such issues, I’ve found that some goroutine-caused memory leaks can be quickly resolved through pprof or monitoring. Instead, it’s the long-term held references that are easily overlooked. These leak signs often appear in:

  • Manually maintained object pools
  • Manually controlled system resources: such as timers, connections, IO buffer streams, etc.
  • Numerical limits with no default upper bound: such as db.maxOpen
Author

马克鱼

Posted on

2020-03-06

Updated on

2025-10-12

Licensed under