Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
atexit(3) can deadlock internally during its own initialization if
jemalloc calls atexit() during jemalloc initialization. Mitigate the
impact by restructuring prof initialization to avoid calling atexit()
unless the registered function will actually dump a final heap profile.
Additionally, disable prof_final by default so that this land mine is
opt-in rather than opt-out.
This resolves#144.
Fix prof regressions related to tdata (main per thread profiling data
structure) destruction:
- Deadlock. The fix for this was intended to be part of
20c31deaae (Test prof.reset mallctl and
fix numerous discovered bugs.) but the fix was left incomplete.
- Destruction race. Detaching tdata just prior to destruction without
holding the tdatas lock made it possible for another thread to destroy
the tdata out from under the thread that was on its way to doing so.
Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.). These regressions were twofold:
1) tsd_tryget() should never (and need never) return NULL. Rename it to
tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
because cleanup happens during the nominal-->purgatory transition,
and re-initialization must not happen while in the purgatory state.
Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get()
can still be used as long as no re-initialization that would require
cleanup occurs. This means that e.g. the thread_allocated counter
can be updated unconditionally.
Implement/test/fix the opt.prof_thread_active_init,
prof.thread_active_init, and thread.prof.active mallctl's.
Test/fix the thread.prof.name mallctl.
Refactor opt_prof_active to be read-only and move mutable state into the
prof_active variable. Stop leaning on ctl-related locking for
protection.
Fix a race that caused a non-critical assertion failure. To trigger the
race, a thread had to be part way through initializing a new sample,
such that it was discoverable by the dumping thread, but not yet linked
into its gctx by the time a later dump phase would normally have reset
its state to 'nominal'.
Additionally, lock access to the state field during modification to
transition to the dumping state. It's not apparent that this oversight
could have caused an actual problem due to outer locking that protects
the dumping machinery, but the added locking pedantically follows the
stated locking protocol for the state field.
Don't use atomic_add_uint64(), because it isn't available on 32-bit
platforms.
Fix forking support functions to manage all prof-related mutexes.
These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
Fix a profile sampling race that was due to preparing to sample, yet
doing nothing to assure that the context remains valid until the stats
are updated.
These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
Fix prof_tdata_get() to avoid dereferencing an invalid tdata pointer
(when it's PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).
Fix prof_tdata_get() callers to check for invalid results besides NULL
(PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).
These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
Rename data structures (prof_thr_cnt_t-->prof_tctx_t,
prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for
sampled objects.
Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace
depth within jemalloc functions is no longer an issue (pprof prunes
irrelevant frames).
Implement mallctl's:
- prof.reset implements full sample data reset, and optional change of
sample interval.
- prof.lg_sample reads the current sample interval (opt.lg_prof_sample
was the permanent source of truth prior to prof.reset).
- thread.prof.name provides naming capability for threads within heap
profile dumps.
- thread.prof.active makes it possible to activate/deactivate heap
profiling for individual threads.
Modify the heap dump files to contain per thread heap profile data.
This change is incompatible with the existing pprof, which will require
enhancements to read and process the enriched data.
Treat prof_tdata_t's bt2cnt as a comprehensive map of the thread's
extant allocation samples (do not limit the total number of entries).
This helps prepare the way for per thread heap profiling.
Simplify backtracing to not ignore any frames, and compensate for this
in pprof in order to increase flexibility with respect to function-based
refactoring even in the presence of non-deterministic inlining. Modify
pprof to blacklist all jemalloc allocation entry points including
non-standard ones like mallocx(), and ignore all allocator-internal
frames. Prior to this change, pprof excluded the specifically
blacklisted functions from backtraces, but it left allocator-internal
frames intact.
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run. In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.
Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
Refactor prof_dump() to use a two pass algorithm, and prof_leave() prior
to the second pass. This avoids write(2) system calls while holding
critical prof resources.
Fix prof_dump() to close the dump file descriptor for all relevant error
paths.
Minimize the size of prof-related static buffers when prof is disabled.
This saves roughly 65 KiB of application memory for non-prof builds.
Refactor prof_ctx_init() out of prof_lookup_global().
Avoid writing to uninitialized TLS as a side effect of deallocation.
Initializing TLS during deallocation is unsafe because it is possible
that a thread never did any allocation, and that TLS has already been
deallocated by the threads library, resulting in write-after-free
corruption. These fixes affect prof_tdata and quarantine; all other
uses of TLS are already safe, whether intentionally (as for tcache) or
unintentionally (as for arenas).
Update hash from MurmurHash2 to MurmurHash3, primarily because the
latter generates 128 bits in a single call for no extra cost, which
simplifies integration with cuckoo hashing.
Add a library constructor for jemalloc that initializes the allocator.
This fixes a race that could occur if threads were created by the main
thread prior to any memory allocation, followed by fork(2), and then
memory allocation in the child process.
Fix the prefork/postfork functions to acquire/release the ctl, prof, and
rtree mutexes. This fixes various fork() child process deadlocks, but
one possible deadlock remains (intentionally) unaddressed: prof
backtracing can acquire runtime library mutexes, so deadlock is still
possible if heap profiling is enabled during fork(). This deadlock is
known to be a real issue in at least the case of libgcc-based
backtracing.
Reported by tfengjun.
Fix a potential deadlock that could occur during interval- and
growth-triggered heap profile dumps.
Fix an off-by-one heap profile statistics bug that could be observed in
interval- and growth-triggered heap profiles.
Fix heap profile dump filename sequence numbers (regression during
conversion to malloc_snprintf()).
Change the "opt.lg_prof_sample" default from 0 to 19 (1 B to 512 KiB).
Change the "opt.prof_accum" default from true to false.
Add the "opt.prof_final" mallctl, so that "opt.prof_prefix" need not be
abused to disable final profile dumping.
s/PAGE_SHIFT/LG_PAGE/g and s/PAGE_SIZE/PAGE/g.
Remove remnants of the dynamic-page-shift code.
Rename the "arenas.pagesize" mallctl to "arenas.page".
Remove the "arenas.chunksize" mallctl, which is redundant with
"opt.lg_chunk".
Remove ephemeral mutexes from the prof machinery, and remove
malloc_mutex_destroy(). This simplifies mutex management on systems
that call malloc()/free() inside pthread_mutex_{create,destroy}().
Add atomic_*_u() for operation on unsigned values.
Fix prof_printf() to call malloc_vsnprintf() rather than
malloc_snprintf().
Implement tsd, which is a TLS/TSD abstraction that uses one or both
internally. Modify bootstrapping such that no tsd's are utilized until
allocation is safe.
Remove malloc_[v]tprintf(), and use malloc_snprintf() instead.
Fix %p argument size handling in malloc_vsnprintf().
Fix a long-standing statistics-related bug in the "thread.arena"
mallctl that could cause crashes due to linked list corruption.
Implement malloc_vsnprintf() (a subset of vsnprintf(3)) as well as
several other printing functions based on it, so that formatted printing
can be relied upon without concern for inducing a dependency on floating
point runtime support. Replace malloc_write() calls with
malloc_*printf() where doing so simplifies the code.
Add name mangling for library-private symbols in the data and BSS
sections. Adjust CONF_HANDLE_*() macros in malloc_conf_init() to expose
all opt_* variable use to cpp so that proper mangling occurs.
Remove opt.lg_prof_bt_max, and hard code it to 7. The original
intention of this option was to enable faster backtracing by limiting
backtrace depth. However, this makes graphical pprof output very
difficult to interpret. In practice, decreasing sampling frequency is a
better mechanism for limiting profiling overhead.
Remove the opt.lg_prof_tcmax option and hard-code a cache size of 1024.
This setting is something that users just shouldn't have to worry about.
If lock contention actually ends up being a problem, the simple solution
available to the user is to reduce sampling frequency.
Convert configuration-related cpp conditional logic to use static
constant variables, e.g.:
#ifdef JEMALLOC_DEBUG
[...]
#endif
becomes:
if (config_debug) {
[...]
}
The advantage is clearer, more concise code. The main disadvantage is
that data structures no longer have conditionally defined fields, so
they pay the cost of all fields regardless of whether they are used. In
practice, this is only a minor concern; config_stats will go away in an
upcoming change, and config_prof is the only other major feature that
depends on more than a few special-purpose fields.
Fix prof_lookup() to artificially raise curobjs for all paths through
the code that creates a new entry in the per thread bt2cnt hash table.
This fixes a race condition that could corrupt memory if prof_accum were
false, and a non-default lg_prof_tcmax were used and/or threads were
destroyed.