Best Practices for Wrapping CPP Library with CGO

Background

Recently, our business needed to reuse a client SDK library written in CPP. To enable our team’s primary language, Golang, to seamlessly integrate with this SDK, we used CGO bridging technology to wrap the C++11 SDK library into a production-ready Golang SDK. After reviewing most of the online resources about CGO in both Chinese and English, I found there are many important details to consider when implementing Go calls to CPP libraries. Most Chinese materials only briefly mention simple examples of wrapping C++ STL library functions. Building on previous work, this article summarizes a series of best practices for wrapping complex CPP libraries, aiming to fill the gap in related documentation.

Unless Necessary, Avoid Using CGO

I list this as the first point as a warning because many CGO beginners may fall into the misconception that using CGO calls instead of regular Go functions will significantly improve single-core CPU execution performance. This actually needs case-by-case analysis. According to the source code of cgocall in the standard library, each CGO call essentially involves a function call stack switch (parameters, PC, and SP need to be set to the external C program), which adds overhead compared to native Go calls. For simple calculation function calls, CGO compared to native execution efficiency can be hundreds of times slower per call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Call from Go to C.
func cgocall(fn, arg unsafe.Pointer) int32 {
// ...
// Reset traceback.
mp.cgoCallers[0] = 0

// Enter system call, at this point Go will create a new system thread M to execute the following operations
entersyscall()

osPreemptExtEnter(mp)

mp.incgo = true
errno := asmcgocall(fn, arg)

// ...
osPreemptExtExit(mp)

exitsyscall()

// ...

// There may be callbacks from C to Go, related objects need to be kept alive for GC before GC decides whether they can be released
KeepAlive(fn)
KeepAlive(arg)
KeepAlive(mp)

return errno
}

So how do we determine when to use CGO? From personal experience, it depends on the return on investment. If using CGO can greatly reuse existing C code, reduce future development and maintenance costs, or if the performance improvement for CPU-intensive functions can well compensate for the side effects of CGO, then CGO is a good choice. Note that if some third-party libraries have already achieved around 80-90% of C library performance using pure Go and Plan9 assembly, it’s not recommended to use CGO bridging libraries, especially when library users need to make many concurrent calls to library functions within Go goroutines. The computational scalability achieved by reusing system threads in user space is far more advantageous than slightly faster CGO calls with kernel context switching.

Follow the Minimalist Principle for C ABI Interface Encapsulation

If we need to manually wrap a CPP library, the actual call chain is Go runtime -> cgocall -> C ABI -> CPP func. ABI is the Application Binary Interface - as long as we follow the specifications and conventions of this binary bridging interface, we can successfully implement external language calls to C interfaces. Therefore, before implementing the bridging interface, it’s recommended to first define structures and methods that conform to Go language semantics, then map Go objects to corresponding C structures or CPP objects, and map Go methods to corresponding C functions and CPP methods. This helps determine the C library header files for bridging. After determining the function signatures for the intermediate bridging layer, then implement the specific logic of the header file using C++ language. Finally, compile and link the CPP implementation and C header files together as a whole, which is the underlying logic for Go to call CPP.

Don’t Use SWIG to Generate C Bridging Code

To avoid manually maintaining C interfaces, some may consider using SWIG to automatically generate stub functions for CPP libraries, which is a common approach for Python calling C++ shared libraries. We actually had similar thoughts when preparing to write bridging code initially, but eventually abandoned this approach. One important reason is that SWIG’s Go generator is not well-maintained in the community and doesn’t support many C++11 syntax features, such as smart pointers. Moreover, the large amount of generated bridging code is unreadable, which increases the cognitive burden for later maintenance and troubleshooting, and may also risk memory leaks. Therefore, in a production environment, it is still recommended to manually encapsulate.

Batching Your Calls

Since each CGO call has a certain additional overhead, this overhead has a linear relationship with the number of calls. A natural optimization approach is to merge multiple CGO calls into one, thereby reducing the additional overhead of CGO calls while processing the same amount of content. When N calls are reduced to 1, the additional overhead of the call itself in a certain period is approximately reduced to 1/N of the original, which can effectively reduce the overall call latency during frequent calls. For example, the CPP calls I encapsulated involve a lot of large string processing, parsing, compression, and serialization operations. Therefore, processing an 8MB large string slice once usually has a lower overall processing latency than processing 1MB eight times.

Reduce Unnecessary Data Copying

CGO provides some common helper functions by default to convert Go’s built-in types to corresponding data types in C. When ensuring that the input parameters are read-only, we can often do some zero-copy tricks.

For example, to pass in a read-only byte slice without data copying:

1
2
b := buf.Bytes()
rc := C.the_function((*C.char)(&b[0]), C.int(buf.Len()))

This is effectively equivalent to:

1
2
3
4
str := buf.String()
p := C.CString(str)
defer C.free(p)
rc = C.the_function(p, C.int(len(str)))

The first approach obviously avoids the additional overhead of copying when dealing with large slices.

Be Careful of Memory Leaks

Copy-passing parameters across languages is generally a safe approach. However, when mixing garbage-collected Go language with manually managed C memory and RAII-based CPP, it’s essential to prevent heap memory leaks in C or CPP layers. Go calling C interfaces generally follows the principle that whoever allocates is responsible for freeing.

For example, if Go allocates heap memory for a C char* and copies the contents of a string variable to the memory space corresponding to the C char*, then the Go layer needs to actively free the memory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package main

/*
#include <stdlib.h>
*/
import "C"

import "unsafe"

func HelloWorld() error {
cs := C.CString("Hello World!")
defer C.free(unsafe.Pointer(cs)) // Go layer is responsible for explicitly calling free in C to release memory

// Safely use cs for CGO calls
// ...
// The bridging layer doesn't need to worry about freeing parameters
err := C.my_c_func(cs)
if err != nil {
// Non-empty C string return values are freed by the receiver
defer C.free(unsafe.Pointer(err))
return errors.New(C.GoString(err))
}
return nil
}

By extension, when passing parameters down through successive layers, ensure that the receiver doesn’t need to worry about freeing the input parameters’ memory, thus avoiding omission or repeated freeing that could cause the entire program to coredump.

Note that when returning results upward, it’s the opposite: the caller who returns is responsible for allocating memory, while the receiver is responsible for actively freeing memory.

Handle Smart Pointers Correctly

In many CPP libraries, especially those written in C++11, smart pointer features are widely used. Many third-party libraries habitually wrap newly created object pointers with shared_ptr for automatic reference counting, so that objects can be correctly destroyed by calling the destructor and freeing the underlying heap memory when they leave their scope. However, in CGO, the reference counting of smart pointers is almost useless because the CPP layer cannot sense the references in the Go layer, let alone correctly count and automatically destroy objects. Therefore, if you want to correctly release object resources when a Go structure holds a CPP object, the best approach is to implement smart pointer reference counting in the Go layer, and finally have the Go layer actively release the object.

If CPP objects and Go objects have a one-to-one mapping, we can elegantly implement CPP object reference counting that’s transparent to the application layer. Since the number of references to a Go object holding a CPP object’s raw pointer equals the number of references to the CPP object, runtime.SetFinalizer comes in perfectly handy. It serves as a callback hook when a Go object is garbage collected, allowing us to easily ensure that the underlying CPP object is safely released when the Go object is released.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
package myClient

/*
#include <stdlib.h>

extern "C" {

void* client_create(const char* a); // Call the constructor in the CPP layer to create an object pointer and return it to the upper layer
void client_release(void* obj); // Call delete in the CPP layer to free the object

}
*/
import "C"
import (
"errors"
"unsafe"
)

type Client struct {
_ptr unsafe.Pointer
}

func New(a string) *Client {
_a := C.CString(a)
defer C.free(unsafe.Pointer(_a))

_ptr := C.client_create(_a)

cli := &Client{
_ptr: _ptr
}

runtime.SetFinalizer(cli, func(c *Client) {
// Ensure the CPP object is automatically destroyed when the Client object is garbage collected
c.free()
})

return cli
}

func (c *Client) free() {
C.client_release(c._ptr);
}

Note that C language doesn’t have concepts like classes and objects, so to ensure ABI compatibility when bridging, we need to use void* universal pointers to define header files. Only during implementation in .cpp can we cast it to the corresponding C++ object pointer and use it further.

Use Exceptions to Improve Code Robustness

Since the underlying implementation uses C++ language, we can wrap try-catch statements during calls to catch exceptions and handle them. If they can’t be handled, we can return error information to the upper-level caller.

1
2
3
4
5
6
7
8
9
func (c *Client) DoSomething() error {
err := C.do_something(c._ptr);
if err != nil {
defer C.free(unsafe.Pointer(err))
return errors.New(C.GoString(err))
}

return nil
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <string.h>

#define CAST_T(_T) reinterpret_cast<_T>

const char* do_something(void* obj) {
try {
CAST_T(CppClient*)(obj)->Do();
}
catch(MyException &e)
{
// handle failure
// ...
return NULL;
}
catch(const exception& e)
{
return strdup(e.what());
}

return NULL;
}

CGO Compilation Engineering

Our CPP SDK is usually maintained in another codebase, so the static and dynamic libraries that CGO compilation depends on are maintained in a separate branch in the CPP library. This ensures that upgrades to the main library don’t affect the maintenance of the CGO bridging CPP code while also making it easy to get updates to the main library.

The approximate directory structure of the Go SDK package is as follows:

1
2
3
4
5
6
➜  myclient tree -h
[] .
├── [] lib
├── [] client.go
├── [] client_mock.go
└── [] wrapper.h

For example, if the encapsulated Go SDK package is named myclient, the objects related to the Go layer and the encapsulation of CGO calls would be implemented in the client.go file. The C ABI header file definition used for bridging would generally be placed in wrapper.h, and the lib directory would be used to store the dependencies needed to compile and link the wrapper.h header file, including static libraries and related dynamic libraries. The dependencies are obtained from a specific branch compiled from the CPP library and copied to the directory in advance. The entire project directory would look clearer and more natural. We choose to maintain the CPP implementation of wrapper.h in a branch of the original CPP library, which is equivalent to completely decoupling the service caller and service provider through interfaces defined in C.

The client_mock.go is used for conditional compilation to adapt to local Mac environment for compilation and debugging, because introducing CGO may break the convenience of cross-platform compilation and debugging. Using conditional compilation can, to some extent, solve the problem of platform barriers causing the entire binary program to fail to compile.

Debugging and pprof Optimization

This might be the point most complained about by many CGO users. Currently, there isn’t a particularly good way to debug CGO programs. For profiling CPU, using perf to generate flame graphs is still analyzable. When compiling with GCC at low optimization levels, specific call stack information can be preserved.

If there’s a memory leak in a CGO call, things get more complicated. I haven’t found a good method for debugging or memory analysis of CGO programs, and welcome suggestions from everyone.

Real-World Pitfalls

When running the wrapped Go SDK in a production environment, we found an abnormal CPU usage maxing out across all cores. Through Go’s CPU pprof, we finally pinpointed the issue to a single computationally intensive CGO call, with approximately hundreds of goroutines concurrently calling it from the upper layer.

1
2
3
go func() {
C.heavy_process_func()
}

However, when actually analyzing the process CPU with the perf top command, we found that the operation consuming the most CPU time was surprisingly _raw_spin_lock at the kernel layer. Throughout the heavy_process_func process, there was no trace of using spin locks in our business logic, and the calling goroutines from the upper layer had no intersection with each other. So where did the contention for mutex locks come from?

Finally, by actively outputting all exception logs from the CPP layer, the truth surfaced: when C++ throws a large number of exceptions concurrently in multiple threads, stack unwinding grabs a global lock in glibc. Therefore, the problem was finally resolved after properly containing the exceptions. In newer versions of glibc, this issue of grabbing a global lock when throwing exceptions in multiple threads has been fixed. For details, see the corresponding issue and patch.

Bug - Concurrently throwing exceptions is not scalable

References

Commercial reproduction requires authorization from the author, non-commercial reproduction please indicate the source. Thank you for your cooperation!

Contact: [email protected]

Author

马克鱼

Posted on

2022-03-27

Updated on

2025-10-12

Licensed under