memory leak of virtual memory

I run tensorflow on linux (ubuntu20). TF executes my c++ functions for graph compilation/destruction.

The consumption of the process virtual memory grows until out-of-memory (>40GB) and the process is killed.

I track malloc/free and mmap/munmap with LD_PRELOAD hook and compare with the process virtual memory consumption from /proc/self/status (VmSize). Each graph compilation increases both malloc-allocated and the process virtual memory by almost the same size.

Graph destruction decreases malloc-allocated size but not the process virtual memory.

So in spite of malloc-allocated memory staying overall stable, the process virtual memory grows fast.

e.g.:

before compile: 41MB[mmap]/3320MB[malloc]/12428MB[process]
after  compile: 46MB[mmap]/7434MB[malloc]/16529MB[process]
before destroy: 46MB[mmap]/7436MB[malloc]/16593MB[process]
after  destroy: 46MB[mmap]/3250MB[malloc]/16593MB[process]

graphDestroy does not destroy everything by design so a small leftover is expected.

I tried to play with mallopt(M_MMAP_THRESHOLD) with no result.

What else can be done in order to find the leak?

UPDATE:

i add here steps that i tried and a way that worked - maybe it can be useful to someone.

the functions themselves are tested with sanitizers in unit tests. valgrind crashes the app before the main training loop starts. So this direction was a dead end.

i wanted to collect memory stats from glibc. unfortunately mallinfo is useless, mallinfo2 is not available on ubuntu20 and malloc_info prints too much. so i tried to use jemalloc and its function malloc_stats_print for stats.

the stats looked ok. but app behavior changed - virtual memory still grew (up to 75GB) but the resident memory stayed stable (~20GB) and the app worked with no memory issues.

then i tried to run my app without jemalloc but with a periodic call of malloc_trim(0) and it behaved the same way as jemalloc (virtual memory grew but resident memory stayed stable).

conclusion: sometimes malloc_trim can fix an issue that looks like a leak.

a good article about it



Comments

Popular posts from this blog

Today Walkin 14th-Sept

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation