Simplify backtracing to not ignore any frames, and compensate for this
in pprof in order to increase flexibility with respect to function-based
refactoring even in the presence of non-deterministic inlining. Modify
pprof to blacklist all jemalloc allocation entry points including
non-standard ones like mallocx(), and ignore all allocator-internal
frames. Prior to this change, pprof excluded the specifically
blacklisted functions from backtraces, but it left allocator-internal
frames intact.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.
Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.
Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
Make dss non-optional on all platforms which support sbrk(2).
Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
"secondary" precedence is specified, but sbrk(2) is not supported.
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run. In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.
Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
When you call free() we load chunk->arena even though that
data isn't used on the tcache hot path.
In profiling some FB applications, I found that ~30% of the
dTLB misses in the free() function come from this line. With
4 MB chunks, the arena_chunk_t->map is ~ 32 KB (1024 pages
in the chunk, 4 8 byte pointers in arena_chunk_map_t). This
means there's only a 1/8 chance of the page containing
chunk->arena also comtaining the map bits.
This happens when it fails to allocate a new chunk. Which
arena_chunk_alloc then passes into arena_avail_insert without any
checks. This then causes a crash when arena_avail_insert tries
to check chunk->ndirty.
This was introduced by the refactoring of arena_chunk_alloc
which previously would have returned NULL immediately after
calling chunk_alloc. This is now the return from
arena_chunk_init_hard so we need to check that return, and
not continue if it was NULL.
If mremap(2) is used for huge reallocation, physical pages are mapped to
new virtual addresses rather than data being copied to new pages. This
bypasses the normal junk filling that would happen during allocation, so
add junk filling that is specific to this case.
Refactor prof_dump() to use a two pass algorithm, and prof_leave() prior
to the second pass. This avoids write(2) system calls while holding
critical prof resources.
Fix prof_dump() to close the dump file descriptor for all relevant error
paths.
Minimize the size of prof-related static buffers when prof is disabled.
This saves roughly 65 KiB of application memory for non-prof builds.
Refactor prof_ctx_init() out of prof_lookup_global().
Refactor overly large functions by breaking out helper functions.
Refactor overly complex multi-purpose functions into separate more
specific functions.
Extract profiling code from malloc(), imemalign(), calloc(), realloc(),
mallocx(), rallocx(), and xallocx(). This slightly reduces the amount
of code compiled into the fast paths, but the primary benefit is the
combinatorial complexity reduction.
Simplify iralloc[t]() by creating a separate ixalloc() that handles the
no-move cases.
Further simplify [mrxn]allocx() (and by implication [mrn]allocm()) to
make request size overflows due to size class and/or alignment
constraints trigger undefined behavior (detected by debug-only
assertions).
Report ENOMEM rather than EINVAL if an OOM occurs during heap profiling
backtrace creation in imemalign(). This bug impacted posix_memalign()
and aligned_alloc().
Add unit tests for pow2_ceil(), malloc_strtoumax(), and
malloc_snprintf().
Fix numerous bugs in malloc_strotumax() error handling/reporting. These
bugs could have caused application-visible issues for some seldom used
(0X... and 0... prefixes) or malformed MALLOC_CONF or mallctl() argument
strings, but otherwise they had no impact.
Fix numerous bugs in malloc_snprintf(). These bugs were not exercised
by existing malloc_*printf() calls, so they had no impact.
Reduce rtree memory usage by storing booleans (1 byte each) rather than
pointers. The rtree code is only used to record whether jemalloc manages
a chunk of memory, so there's no need to store pointers in the rtree.
Increase rtree node size to 64 KiB in order to reduce tree depth from 13
to 3 on 64-bit systems. The conversion to more compact leaf nodes was
enough by itself to make the rtree depth 1 on 32-bit systems; due to the
fact that root nodes are smaller than the specified node size if
possible, the node size change has no impact on 32-bit systems (assuming
default chunk size).
Verify that freed regions are quarantined, and that redzone corruption
is detected.
Introduce a testing idiom for intercepting/replacing internal functions.
In this case the replaced function is ordinarily a static function, but
the idiom should work similarly for library-private functions.
Don't junk fill reallocations for which the request size is less than
the current usable size, but not enough smaller to cause a size class
change. Unlike malloc()/calloc()/realloc(), *allocx() contractually
treats the full usize as the allocation, so a caller can ask for zeroed
memory via mallocx() and a series of rallocx() calls that all specify
MALLOCX_ZERO, and be assured that all newly allocated bytes will be
zeroed and made available to the application without danger of allocator
mutation until the size class decreases enough to cause usize reduction.
Refactor such that arena_prof_ctx_set() receives usize as an argument,
and use it to determine whether to handle ptr as a small region, rather
than reading the chunk page map.
Implement the *allocx() API, which is a successor to the *allocm() API.
The *allocx() functions are slightly simpler to use because they have
fewer parameters, they directly return the results of primary interest,
and mallocx()/rallocx() avoid the strict aliasing pitfall that
allocm()/rallocx() share with posix_memalign(). The following code
violates strict aliasing rules:
foo_t *foo;
allocm((void **)&foo, NULL, 42, 0);
whereas the following is safe:
foo_t *foo;
void *p;
allocm(&p, NULL, 42, 0);
foo = (foo_t *)p;
mallocx() does not have this problem:
foo_t *foo = (foo_t *)mallocx(42, 0);
Add JEMALLOC_INLINE_C and use it instead of JEMALLOC_INLINE in .c files,
so that the annotated functions are always static.
Remove SFMT's inline-related macros and use jemalloc's instead, so that
there's no danger of interactions with jemalloc's definitions that
disable inlining for debug builds.
Refactor tests to use explicit testing assertions, rather than diff'ing
test output. This makes the test code a bit shorter, more explicitly
encodes testing intent, and makes test failure diagnosis more
straightforward.