Move small run metadata into the arena chunk header, with multiple
expected benefits:
- Lower run fragmentation due to reduced run sizes; runs are more likely
to completely drain when there are fewer total regions.
- Improved cache behavior. Prior to this change, run headers were
always page-aligned, which put extra pressure on some CPU cache sets.
The degree to which this was a problem was hardware dependent, but it
likely hurt some even for the most advanced modern hardware.
- Buffer overruns/underruns are less likely to corrupt allocator
metadata.
- Size classes between 4 KiB and 16 KiB become reasonable to support
without any special handling, and the runs are small enough that dirty
unused pages aren't a significant concern.
Fix a race that caused a non-critical assertion failure. To trigger the
race, a thread had to be part way through initializing a new sample,
such that it was discoverable by the dumping thread, but not yet linked
into its gctx by the time a later dump phase would normally have reset
its state to 'nominal'.
Additionally, lock access to the state field during modification to
transition to the dumping state. It's not apparent that this oversight
could have caused an actual problem due to outer locking that protects
the dumping machinery, but the added locking pedantically follows the
stated locking protocol for the state field.
It has an unused variable, so it was always failing (at least with gcc
4.9.1). Alternatively, the `-Werror` flag could be removed if it isn't
strictly necessary.
Don't use atomic_add_uint64(), because it isn't available on 32-bit
platforms.
Fix forking support functions to manage all prof-related mutexes.
These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
Fix irallocx_prof() sample logic to only update the threshold counter
after it knows what size the allocation ended up being. This regression
was caused by 6e73dc194e (Fix a profile
sampling race.), which did not make it into any releases prior to this
fix.
* assertion failure
* malloc_init failure
* malloc not already initialized (in malloc_init)
* running in valgrind
* thread cache disabled at runtime
Clang and GCC already consider a comparison with NULL or -1 to be cold,
so many branches (out-of-memory) are already correctly considered as
cold and marking them is not important.
Fix a profile sampling race that was due to preparing to sample, yet
doing nothing to assure that the context remains valid until the stats
are updated.
These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
Fix prof_tdata_get() to avoid dereferencing an invalid tdata pointer
(when it's PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).
Fix prof_tdata_get() callers to check for invalid results besides NULL
(PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).
These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
- Add a --thread N option to select profile for thread N (otherwise, all
threads will be printed)
- The $profile map now has a {threads} element that is a map from thread id to
a profile that has the same format as the {profile} element
- Refactor ReadHeapProfile into smaller components and use them to implement
ReadThreadedHeapProfile
This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller. It avoids some extra reads in the
thread cache fast path. In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.
An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.
The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%. It may have a different impact on a
real workload.
Closes#28
Optimize [nmd]alloc() fast paths such that the (flags == 0) case is
streamlined, flags decoding only happens to the minimum degree
necessary, and no conditionals are repeated.
Relax the "are we in a git repo?" check to succeed even if the top level
jemalloc directory is not at the top level of the git repo.
Add git tag filtering so that only version triplets match when
generating VERSION.
Add fallback bogus VERSION creation, so that in the worst case, rather
than generating empty values for e.g. JEMALLOC_VERSION_MAJOR,
configuration ends up generating useless constants.
Junk filling is done in arena_dalloc_bin_locked(), so arena_alloc_junk_small()
is redundant. Also, we should use arena_dalloc_junk_small() instead of
arena_alloc_junk_small().
__*_hook() is glibc, but on at least one glibc platform (homebrew),
the __GLIBC__ define isn't set correctly and we miss being able to
use these hooks.
Do a feature test for it during configuration so that we enable it
anywhere the hooks are actually available.
Rename data structures (prof_thr_cnt_t-->prof_tctx_t,
prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for
sampled objects.
Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace
depth within jemalloc functions is no longer an issue (pprof prunes
irrelevant frames).
Implement mallctl's:
- prof.reset implements full sample data reset, and optional change of
sample interval.
- prof.lg_sample reads the current sample interval (opt.lg_prof_sample
was the permanent source of truth prior to prof.reset).
- thread.prof.name provides naming capability for threads within heap
profile dumps.
- thread.prof.active makes it possible to activate/deactivate heap
profiling for individual threads.
Modify the heap dump files to contain per thread heap profile data.
This change is incompatible with the existing pprof, which will require
enhancements to read and process the enriched data.