Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.). These regressions were twofold:
1) tsd_tryget() should never (and need never) return NULL. Rename it to
tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
because cleanup happens during the nominal-->purgatory transition,
and re-initialization must not happen while in the purgatory state.
Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get()
can still be used as long as no re-initialization that would require
cleanup occurs. This means that e.g. the thread_allocated counter
can be updated unconditionally.
Implement/test/fix the opt.prof_thread_active_init,
prof.thread_active_init, and thread.prof.active mallctl's.
Test/fix the thread.prof.name mallctl.
Refactor opt_prof_active to be read-only and move mutable state into the
prof_active variable. Stop leaning on ctl-related locking for
protection.
Move small run metadata into the arena chunk header, with multiple
expected benefits:
- Lower run fragmentation due to reduced run sizes; runs are more likely
to completely drain when there are fewer total regions.
- Improved cache behavior. Prior to this change, run headers were
always page-aligned, which put extra pressure on some CPU cache sets.
The degree to which this was a problem was hardware dependent, but it
likely hurt some even for the most advanced modern hardware.
- Buffer overruns/underruns are less likely to corrupt allocator
metadata.
- Size classes between 4 KiB and 16 KiB become reasonable to support
without any special handling, and the runs are small enough that dirty
unused pages aren't a significant concern.
Fix a profile sampling race that was due to preparing to sample, yet
doing nothing to assure that the context remains valid until the stats
are updated.
These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller. It avoids some extra reads in the
thread cache fast path. In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.
An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.
The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%. It may have a different impact on a
real workload.
Closes#28
Optimize [nmd]alloc() fast paths such that the (flags == 0) case is
streamlined, flags decoding only happens to the minimum degree
necessary, and no conditionals are repeated.
Rename data structures (prof_thr_cnt_t-->prof_tctx_t,
prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for
sampled objects.
Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace
depth within jemalloc functions is no longer an issue (pprof prunes
irrelevant frames).
Implement mallctl's:
- prof.reset implements full sample data reset, and optional change of
sample interval.
- prof.lg_sample reads the current sample interval (opt.lg_prof_sample
was the permanent source of truth prior to prof.reset).
- thread.prof.name provides naming capability for threads within heap
profile dumps.
- thread.prof.active makes it possible to activate/deactivate heap
profiling for individual threads.
Modify the heap dump files to contain per thread heap profile data.
This change is incompatible with the existing pprof, which will require
enhancements to read and process the enriched data.
Add size class computation capability, currently used only as validation
of the size class lookup tables. Generalize the size class spacing used
for bins, for eventual use throughout the full range of allocation
sizes.
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation). This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.
Remove the top level huge stats, and replace them with per arena huge
stats.
Normalize function names and types to *dalloc* (some were *dealloc*).
Remove the --enable-mremap option. As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance. The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.
Add new mallctl endpoints "arena<i>.chunk.alloc" and
"arena<i>.chunk.dealloc" to allow userspace to configure
jemalloc's chunk allocator and deallocator on a per-arena
basis.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.
Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.
Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run. In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.
Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
Extract profiling code from malloc(), imemalign(), calloc(), realloc(),
mallocx(), rallocx(), and xallocx(). This slightly reduces the amount
of code compiled into the fast paths, but the primary benefit is the
combinatorial complexity reduction.
Simplify iralloc[t]() by creating a separate ixalloc() that handles the
no-move cases.
Further simplify [mrxn]allocx() (and by implication [mrn]allocm()) to
make request size overflows due to size class and/or alignment
constraints trigger undefined behavior (detected by debug-only
assertions).
Report ENOMEM rather than EINVAL if an OOM occurs during heap profiling
backtrace creation in imemalign(). This bug impacted posix_memalign()
and aligned_alloc().
Verify that freed regions are quarantined, and that redzone corruption
is detected.
Introduce a testing idiom for intercepting/replacing internal functions.
In this case the replaced function is ordinarily a static function, but
the idiom should work similarly for library-private functions.
Implement the *allocx() API, which is a successor to the *allocm() API.
The *allocx() functions are slightly simpler to use because they have
fewer parameters, they directly return the results of primary interest,
and mallocx()/rallocx() avoid the strict aliasing pitfall that
allocm()/rallocx() share with posix_memalign(). The following code
violates strict aliasing rules:
foo_t *foo;
allocm((void **)&foo, NULL, 42, 0);
whereas the following is safe:
foo_t *foo;
void *p;
allocm(&p, NULL, 42, 0);
foo = (foo_t *)p;
mallocx() does not have this problem:
foo_t *foo = (foo_t *)mallocx(42, 0);
Refactor the test harness to support three types of tests:
- unit: White box unit tests. These tests have full access to all
internal jemalloc library symbols. Though in actuality all symbols
are prefixed by jet_, macro-based name mangling abstracts this away
from test code.
- integration: Black box integration tests. These tests link with
the installable shared jemalloc library, and with the exception of
some utility code and configure-generated macro definitions, they have
no access to jemalloc internals.
- stress: Black box stress tests. These tests link with the installable
shared jemalloc library, as well as with an internal allocator with
symbols prefixed by jet_ (same as for unit tests) that can be used to
allocate data structures that are internal to the test code.
Move existing tests into test/{unit,integration}/ as appropriate.
Split out internal parts of jemalloc_defs.h.in and put them in
jemalloc_internal_defs.h.in. This reduces internals exposure to
applications that #include <jemalloc/jemalloc.h>.
Refactor jemalloc.h header generation so that a single header file
results, and the prototypes can be used to generate jet_ prototypes for
tests. Split jemalloc.h.in into multiple parts (jemalloc_defs.h.in,
jemalloc_macros.h.in, jemalloc_protos.h.in, jemalloc_mangle.h.in) and
use a shell script to generate a unified jemalloc.h at configure time.
Change the default private namespace prefix from "" to "je_".
Add missing private namespace mangling.
Remove hard-coded private_namespace.h. Instead generate it and
private_unnamespace.h from private_symbols.txt. Use similar logic for
public symbols, which aids in name mangling for jet_ symbols.
Add test_warn() and test_fail(). Replace existing exit(1) calls with
test_fail() calls.