This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller. It avoids some extra reads in the
thread cache fast path. In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.
An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.
The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%. It may have a different impact on a
real workload.
Closes#28
Optimize [nmd]alloc() fast paths such that the (flags == 0) case is
streamlined, flags decoding only happens to the minimum degree
necessary, and no conditionals are repeated.
__*_hook() is glibc, but on at least one glibc platform (homebrew),
the __GLIBC__ define isn't set correctly and we miss being able to
use these hooks.
Do a feature test for it during configuration so that we enable it
anywhere the hooks are actually available.
Rename data structures (prof_thr_cnt_t-->prof_tctx_t,
prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for
sampled objects.
Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace
depth within jemalloc functions is no longer an issue (pprof prunes
irrelevant frames).
Implement mallctl's:
- prof.reset implements full sample data reset, and optional change of
sample interval.
- prof.lg_sample reads the current sample interval (opt.lg_prof_sample
was the permanent source of truth prior to prof.reset).
- thread.prof.name provides naming capability for threads within heap
profile dumps.
- thread.prof.active makes it possible to activate/deactivate heap
profiling for individual threads.
Modify the heap dump files to contain per thread heap profile data.
This change is incompatible with the existing pprof, which will require
enhancements to read and process the enriched data.
Treat prof_tdata_t's bt2cnt as a comprehensive map of the thread's
extant allocation samples (do not limit the total number of entries).
This helps prepare the way for per thread heap profiling.
Fix runs_dirty-based purging to also purge dirty pages in the spare
chunk.
Refactor runs_dirty manipulation into arena_dirty_{insert,remove}(), and
move the arena->ndirty accounting into those functions.
Remove the u.ql_link field from arena_chunk_map_t, and get rid of the
enclosing union for u.rb_link, since only rb_link remains.
Remove the ndirty field from arena_chunk_t.
Some platforms, such as Google's Portable Native Client, use Newlib and
thus lack access to madvise(2). In those instances, pages_purge() is
transformed into a no-op.
Some platforms (like those using Newlib) don't have ffs/ffsl. This
commit adds a check to configure.ac for __builtin_ffsl if ffsl isn't
found. __builtin_ffsl performs the same function as ffsl, and has the
added benefit of being available on any platform utilizing
Gcc-compatible compiler.
This change does not address the used of ffs in the MALLOCX_ARENA()
macro.
Add size class computation capability, currently used only as validation
of the size class lookup tables. Generalize the size class spacing used
for bins, for eventual use throughout the full range of allocation
sizes.
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation). This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.
Remove the top level huge stats, and replace them with per arena huge
stats.
Normalize function names and types to *dalloc* (some were *dealloc*).
Remove the --enable-mremap option. As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance. The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.
Add new mallctl endpoints "arena<i>.chunk.alloc" and
"arena<i>.chunk.dealloc" to allow userspace to configure
jemalloc's chunk allocator and deallocator on a per-arena
basis.
Simplify backtracing to not ignore any frames, and compensate for this
in pprof in order to increase flexibility with respect to function-based
refactoring even in the presence of non-deterministic inlining. Modify
pprof to blacklist all jemalloc allocation entry points including
non-standard ones like mallocx(), and ignore all allocator-internal
frames. Prior to this change, pprof excluded the specifically
blacklisted functions from backtraces, but it left allocator-internal
frames intact.
Fix debug-only compilation failures introduced by changes to
prof_sample_accum_update() in:
6c39f9e059
refactor profiling. only use a bytes till next sample variable.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.
Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.
Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
Make dss non-optional on all platforms which support sbrk(2).
Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
"secondary" precedence is specified, but sbrk(2) is not supported.
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run. In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.
Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
When you call free() we load chunk->arena even though that
data isn't used on the tcache hot path.
In profiling some FB applications, I found that ~30% of the
dTLB misses in the free() function come from this line. With
4 MB chunks, the arena_chunk_t->map is ~ 32 KB (1024 pages
in the chunk, 4 8 byte pointers in arena_chunk_map_t). This
means there's only a 1/8 chance of the page containing
chunk->arena also comtaining the map bits.
The hash code, which has MurmurHash3 at its core, generates different
output depending on system endianness, so adapt the expected output on
big-endian systems. MurmurHash3 code also makes the assumption that
unaligned access is okay (not true on all systems), but jemalloc only
hashes data structures that have sufficient alignment to dodge this
limitation.
Add a cpp #define that removes 'restrict' keyword usage unless the
compiler definitely supports C99. As written, 'restrict' is only
enabled if the compiler supports the -std=gnu99 option (e.g. gcc and
llvm).
Reported by Tobias Hieta.
Avoid copying "jeprof" to a 1-byte buffer within prof_boot0() when heap
profiling is disabled. Although this is dead code under such
conditions, the compiler doesn't figure that part out.
Reported by Eduardo Silva.
Fix stress tests such that testlib code uses the jet_ allocator, but
test code uses libjemalloc.
Generate jemalloc_{rename,mangle}.h, the former because it's needed for
the stress test name mangling fix, and the latter for consistency. As
an artifact of this change, some (but not all) definitions related to
the experimental API are absent from the headers unless the feature is
enabled at configure time.
Refactor prof_dump() to use a two pass algorithm, and prof_leave() prior
to the second pass. This avoids write(2) system calls while holding
critical prof resources.
Fix prof_dump() to close the dump file descriptor for all relevant error
paths.
Minimize the size of prof-related static buffers when prof is disabled.
This saves roughly 65 KiB of application memory for non-prof builds.
Refactor prof_ctx_init() out of prof_lookup_global().
Refactor overly large functions by breaking out helper functions.
Refactor overly complex multi-purpose functions into separate more
specific functions.
Extract profiling code from malloc(), imemalign(), calloc(), realloc(),
mallocx(), rallocx(), and xallocx(). This slightly reduces the amount
of code compiled into the fast paths, but the primary benefit is the
combinatorial complexity reduction.
Simplify iralloc[t]() by creating a separate ixalloc() that handles the
no-move cases.
Further simplify [mrxn]allocx() (and by implication [mrn]allocm()) to
make request size overflows due to size class and/or alignment
constraints trigger undefined behavior (detected by debug-only
assertions).
Report ENOMEM rather than EINVAL if an OOM occurs during heap profiling
backtrace creation in imemalign(). This bug impacted posix_memalign()
and aligned_alloc().
Add unit tests for pow2_ceil(), malloc_strtoumax(), and
malloc_snprintf().
Fix numerous bugs in malloc_strotumax() error handling/reporting. These
bugs could have caused application-visible issues for some seldom used
(0X... and 0... prefixes) or malformed MALLOC_CONF or mallctl() argument
strings, but otherwise they had no impact.
Fix numerous bugs in malloc_snprintf(). These bugs were not exercised
by existing malloc_*printf() calls, so they had no impact.
Reduce rtree memory usage by storing booleans (1 byte each) rather than
pointers. The rtree code is only used to record whether jemalloc manages
a chunk of memory, so there's no need to store pointers in the rtree.
Increase rtree node size to 64 KiB in order to reduce tree depth from 13
to 3 on 64-bit systems. The conversion to more compact leaf nodes was
enough by itself to make the rtree depth 1 on 32-bit systems; due to the
fact that root nodes are smaller than the specified node size if
possible, the node size change has no impact on 32-bit systems (assuming
default chunk size).
Verify that freed regions are quarantined, and that redzone corruption
is detected.
Introduce a testing idiom for intercepting/replacing internal functions.
In this case the replaced function is ordinarily a static function, but
the idiom should work similarly for library-private functions.
Don't junk fill reallocations for which the request size is less than
the current usable size, but not enough smaller to cause a size class
change. Unlike malloc()/calloc()/realloc(), *allocx() contractually
treats the full usize as the allocation, so a caller can ask for zeroed
memory via mallocx() and a series of rallocx() calls that all specify
MALLOCX_ZERO, and be assured that all newly allocated bytes will be
zeroed and made available to the application without danger of allocator
mutation until the size class decreases enough to cause usize reduction.
Refactor such that arena_prof_ctx_set() receives usize as an argument,
and use it to determine whether to handle ptr as a small region, rather
than reading the chunk page map.
Move je_* definitions from jemalloc_macros.h.in to jemalloc_defs.h.in,
because only the latter is an autoconf header (#undef substitution
occurs).
Fix unit tests to use automatic mangling, so that e.g. mallocx is
macro-substituted to becom jet_mallocx.
Implement the *allocx() API, which is a successor to the *allocm() API.
The *allocx() functions are slightly simpler to use because they have
fewer parameters, they directly return the results of primary interest,
and mallocx()/rallocx() avoid the strict aliasing pitfall that
allocm()/rallocx() share with posix_memalign(). The following code
violates strict aliasing rules:
foo_t *foo;
allocm((void **)&foo, NULL, 42, 0);
whereas the following is safe:
foo_t *foo;
void *p;
allocm(&p, NULL, 42, 0);
foo = (foo_t *)p;
mallocx() does not have this problem:
foo_t *foo = (foo_t *)mallocx(42, 0);
Add mtx (mutex) to test infrastructure, in order to avoid bootstrapping
complications that would result from directly using malloc_mutex.
Rename test infrastructure's thread abstraction from je_thread to thd.
Fix some header ordering issues.
Add JEMALLOC_INLINE_C and use it instead of JEMALLOC_INLINE in .c files,
so that the annotated functions are always static.
Remove SFMT's inline-related macros and use jemalloc's instead, so that
there's no danger of interactions with jemalloc's definitions that
disable inlining for debug builds.
Add probabability distribution utility code that enables generation of
random deviates drawn from normal, Chi-square, and Gamma distributions.
Fix format strings in several of the assert_* macros (remove a %s).
Clean up header issues; it's critical that system headers are not
included after internal definitions potentially do things like:
#define inline
Fix the build system to incorporate header dependencies for the test
library C files.
Refactor tests to use explicit testing assertions, rather than diff'ing
test output. This makes the test code a bit shorter, more explicitly
encodes testing intent, and makes test failure diagnosis more
straightforward.
Unless heap profiling is enabled, disable floating point code and don't
link with libm. This, in combination with e.g. EXTRA_CFLAGS=-mno-sse on
x64 systems, makes it possible to completely disable floating point
register use. Some versions of glibc neglect to save/restore
caller-saved floating point registers during dynamic lazy symbol
loading, and the symbol loading code uses whatever malloc the
application happens to have linked/loaded with, the result being
potential floating point register corruption.
Refactor the test harness to support three types of tests:
- unit: White box unit tests. These tests have full access to all
internal jemalloc library symbols. Though in actuality all symbols
are prefixed by jet_, macro-based name mangling abstracts this away
from test code.
- integration: Black box integration tests. These tests link with
the installable shared jemalloc library, and with the exception of
some utility code and configure-generated macro definitions, they have
no access to jemalloc internals.
- stress: Black box stress tests. These tests link with the installable
shared jemalloc library, as well as with an internal allocator with
symbols prefixed by jet_ (same as for unit tests) that can be used to
allocate data structures that are internal to the test code.
Move existing tests into test/{unit,integration}/ as appropriate.
Split out internal parts of jemalloc_defs.h.in and put them in
jemalloc_internal_defs.h.in. This reduces internals exposure to
applications that #include <jemalloc/jemalloc.h>.
Refactor jemalloc.h header generation so that a single header file
results, and the prototypes can be used to generate jet_ prototypes for
tests. Split jemalloc.h.in into multiple parts (jemalloc_defs.h.in,
jemalloc_macros.h.in, jemalloc_protos.h.in, jemalloc_mangle.h.in) and
use a shell script to generate a unified jemalloc.h at configure time.
Change the default private namespace prefix from "" to "je_".
Add missing private namespace mangling.
Remove hard-coded private_namespace.h. Instead generate it and
private_unnamespace.h from private_symbols.txt. Use similar logic for
public symbols, which aids in name mangling for jet_ symbols.
Add test_warn() and test_fail(). Replace existing exit(1) calls with
test_fail() calls.
When using LinuxThreads pthread_setspecific triggers recursive
allocation on all threads. Work around this by creating a global linked
list of in-progress tsd initializations.
This modifies the _tsd_get_wrapper macro-generated function. When it has
to initialize an TSD object it will push the item to the linked list
first. If this causes a recursive allocation then the _get_wrapper
request is satisfied from the list. When pthread_setspecific returns the
item is removed from the list.
This effectively adds a very poor substitute for real TLS used only
during pthread_setspecific allocation recursion.
Signed-off-by: Crestez Dan Leonard <lcrestez@ixiacom.com>
Fix a Valgrind integration flaw that caused Valgrind warnings about
reads of uninitialized memory in internal zero-initialized data
structures (relevant to tcache and prof code).
Add the JEMALLOC_ALWAYS_INLINE_C macro and use it for always-inlined
functions declared in .c files. This fixes a function attribute
inconsistency for debug builds that resulted in (harmless) compiler
warnings about functions not being inlinable.
Reported by Ricardo Nabinger Sanchez.
Add no-op bodies to VALGRIND_*() macro stubs so that they can be used in
contexts like the following without generating a compiler warning about
the 'if' statement having an empty body:
if (config_valgrind)
VALGRIND_MAKE_MEM_UNDEFINED(ret, size);
Checking for __s390x__ means you work on s390x, but not s390 (32bit)
systems. So use __s390__ which works for both.
With this, `make check` passes on s390.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Avoid writing to uninitialized TLS as a side effect of deallocation.
Initializing TLS during deallocation is unsafe because it is possible
that a thread never did any allocation, and that TLS has already been
deallocated by the threads library, resulting in write-after-free
corruption. These fixes affect prof_tdata and quarantine; all other
uses of TLS are already safe, whether intentionally (as for tcache) or
unintentionally (as for arenas).
Update hash from MurmurHash2 to MurmurHash3, primarily because the
latter generates 128 bits in a single call for no extra cost, which
simplifies integration with cuckoo hashing.
Tighten valgrind integration such that immediately after memory is
validated or zeroed, valgrind is told to forget the memory's 'defined'
state. The only place newly allocated memory should be left marked as
'defined' is in the public functions (e.g. calloc() and realloc()).
Purge unused dirty pages in an order that first performs clean/dirty run
defragmentation, in order to mitigate available run fragmentation.
Remove the limitation that prevented purging unless at least one chunk
worth of dirty pages had accumulated in an arena. This limitation was
intended to avoid excessive purging for small applications, but the
threshold was arbitrary, and the effect of questionable utility.
Relax opt_lg_dirty_mult from 5 to 3. This compensates for increased
likelihood of allocating clean runs, given the same ratio of clean:dirty
runs, and reduces the potential for repeated purging in pathological
large malloc/free loops that push the active:dirty page ratio just over
the purge threshold.
Add the "arenas.extend" mallctl, so that it is possible to create new
arenas that are outside the set that jemalloc automatically multiplexes
threads onto.
Add the ALLOCM_ARENA() flag for {,r,d}allocm(), so that it is possible
to explicitly allocate from a particular arena.
Add the "opt.dss" mallctl, which controls the default precedence of dss
allocation relative to mmap allocation.
Add the "arena.<i>.dss" mallctl, which makes it possible to set the
default dss precedence on a per arena or global basis.
Add the "arena.<i>.purge" mallctl, which obsoletes "arenas.purge".
Add the "stats.arenas.<i>.dss" mallctl.