Refactor ctl_stats_t to be a demand-zeroed non-growing data structure.
To keep the size from being onerous (~60 MiB) on 32-bit systems, convert
the arenas field to contain pointers rather than directly embedded
ctl_arena_stats_t elements.
Add the MALLCTL_ARENAS_ALL cpp macro as a fixed index for use
in accessing the arena.<i>.{purge,decay,dss} and stats.arenas.<i>.*
mallctls, and deprecate access via the arenas.narenas index (to be
removed in 6.0.0).
Add/rename related mallctls:
- Add stats.arenas.<i>.base .
- Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal .
- Add stats.arenas.<i>.resident .
Modify the arenas.extend mallctl to take an optional (extent_hooks_t *)
argument so that it is possible for all base allocations to be serviced
by the specified extent hooks.
This resolves#463.
Split purging into lazy and forced variants. Use the forced variant for
zeroing dss.
Add support for NULL function pointers as an opt-out mechanism for the
dalloc, commit, decommit, purge_lazy, purge_forced, split, and merge
fields of extent_hooks_t.
Add short-circuiting checks in large_ralloc_no_move_{shrink,expand}() so
that no attempt is made if splitting/merging is not supported.
This resolves#268.
If virtual memory is retained, allocate extents such that their sizes
form an exponentially growing series. This limits the number of
disjoint virtual memory ranges so that extent merging can be effective
even if multiple arenas' extent allocation requests are highly
interleaved.
This resolves#462.
Add the --with-lg-hugepage configure option, but automatically configure
LG_HUGEPAGE even if it isn't specified.
Add the pages_[no]huge() functions, which toggle huge page state via
madvise(..., MADV_[NO]HUGEPAGE) calls.
These bugs were introduced by 0ba5b9b618
(Add "J" (JSON) support to malloc_stats_print().), which was backported
as b599b32280 (with the same bugs except
the inapplicable "metatata" misspelling) and first released in 4.3.0.
Rewrite arena_slab_regind() to provide sufficient constant data for
the compiler to perform division strength reduction. This replaces
more general manual strength reduction that was implemented before
arena_bin_info was compile-time-constant. It would be possible to
slightly improve on the compiler-generated division code by taking
advantage of range limits that the compiler doesn't know about.
Adds cpp bindings for jemalloc, along with necessary autoconf settings.
This is mostly to add sized deallocation support, which can't be added
from C directly. Sized deallocation is ~10% microbench improvement.
* Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the
easiest way to get c++14 detection.
* Adds various other changes, like CXXFLAGS, to configure.ac.
* Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic
unittest.
* Both new and delete are overridden, to ensure jemalloc is used for
both.
* TODO future enhancement of avoiding extra PLT thunks for new and
delete - sdallocx and malloc are publicly exported jemalloc symbols,
using an alias would link them directly. Unfortunately, was having
trouble getting it to play nice with jemalloc's namespace support.
Testing:
Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0. Only gcc >= 5 has sized
deallocation support, verified that the rest build correctly.
Tested mac osx and Centos.
Tested --with-jemalloc-prefix and --without-export.
This resolves#202.
Some versions of Android provide a pthreads library without providing
pthread_atfork(), so in practice a separate feature test is necessary
for the latter.
Add feature tests for the MADV_FREE and MADV_DONTNEED flags to
madvise(2), so that MADV_FREE is detected and used for Linux kernel
versions 4.5 and newer. Refactor pages_purge() so that on systems which
support both flags, MADV_FREE is preferred over MADV_DONTNEED.
This resolves#387.
Add extent serial numbers and use them where appropriate as a sort key
that is higher priority than address, so that the allocation policy
prefers older extents.
This resolves#147.
Do not call s2u() during alloc_size computation, since any necessary
ceiling increase is taken care of later by extent_first_best_fit() -->
extent_size_quantize_ceil(), and the s2u() call may erroneously cause a
higher quantization result.
Remove an overly strict overflow check that was added in
4a7852137d (Fix extent_recycle()'s
cache-oblivious padding support.).
Add padding *after* computing the size class, so that the optimal size
class isn't skipped during search for a usable extent. This regression
was caused by b46261d58b (Implement
cache-oblivious support for huge size classes.).
Add an "over-size" extent heap in which to store extents which exceed
the maximum size class (plus cache-oblivious padding, if enabled).
Remove psz2ind_clamp() and use psz2ind() instead so that trying to
allocate the maximum size class can in principle succeed. In practice,
this allows assertions to hold so that OOM errors can be successfully
generated.
Fix extent_alloc_cache[_locked]() to support decommitted allocation, and
use this ability in arena_stash_dirty(), so that decommitted extents are
not needlessly committed during purging. In practice this does not
happen on any currently supported systems, because both extent merging
and decommit must be implemented; all supported systems implement one
xor the other.
rtree_node_init spinlocks the node, allocates, and then sets the node.
This is under heavy contention at the top of the tree if many threads
start to allocate at the same time.
Instead, take a per-rtree sleeping mutex to reduce spinning. Tested
both pthreads and osx OSSpinLock, and both reduce spinning adequately
Previous benchmark time:
./ttest1 500 100
~15s
New benchmark time:
./ttest1 500 100
.57s
Fix zone_force_unlock() to reinitialize, rather than unlocking mutexes,
since OS X 10.12 cannot tolerate a child unlocking mutexes that were
locked by its parent.
Refactor; this was a side effect of experimenting with zone
{de,re}registration during fork(2).
The raw clock variant is slow (even relative to plain CLOCK_MONOTONIC),
whereas the coarse clock variant is faster than CLOCK_MONOTONIC, but
still has resolution (~1ms) that is adequate for our purposes.
This resolves#479.
Some applications wrap various system calls, and if they call the
allocator in their wrappers, unexpected reentry can result. This is not
a general solution (many other syscalls are spread throughout the code),
but this resolves a bootstrapping issue that is apparently common.
This resolves#443.
This works around malloc_conf not being properly initialized by at least
the cygwin toolchain. Prior build system changes to use
-Wl,--[no-]whole-archive may be necessary for malloc_conf resolution to
work properly as a non-weak symbol (not tested).
This is generally correct (no need for weak symbols since no jemalloc
library is involved in the link phase), and avoids linking problems
(apparently unininitialized non-NULL malloc_conf) when using cygwin with
gcc.
glibc defines its malloc implementation with several weak and strong
symbols:
strong_alias (__libc_calloc, __calloc) weak_alias (__libc_calloc, calloc)
strong_alias (__libc_free, __cfree) weak_alias (__libc_free, cfree)
strong_alias (__libc_free, __free) strong_alias (__libc_free, free)
strong_alias (__libc_malloc, __malloc) strong_alias (__libc_malloc, malloc)
The issue is not with the weak symbols, but that other parts of glibc
depend on __libc_malloc explicitly. Defining them in terms of jemalloc
API's allows the linker to drop glibc's malloc.o completely from the link,
and static linking no longer results in symbol collisions.
Another wrinkle: jemalloc during initialization calls sysconf to
get the number of CPU's. GLIBC allocates for the first time before
setting up isspace (and other related) tables, which are used by
sysconf. Instead, use the pthread API to get the number of
CPUs with GLIBC, which seems to work.
This resolves#442.
Rather than protecting dss operations with a mutex, use atomic
operations. This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
extent_in_dss() and the newly added extent_dss_mergeable(), which can be
called multiple times during extent deallocations.
This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.
This resolves#425.
Add spin_t and spin_{init,adaptive}(), which provide a simple
abstraction for adaptive spinning.
Adaptively spin during busy waits in bootstrapping and rtree node
initialization.
Simplify decay-based purging attempts to only be triggered when the
epoch is advanced, rather than every time purgeable memory increases.
In a correctly functioning system (not previously the case; see below),
this only causes a behavior difference if during subsequent purge
attempts the least recently used (LRU) purgeable memory extent is
initially too large to be purged, but that memory is reused between
attempts and one or more of the next LRU purgeable memory extents are
small enough to be purged. In practice this is an arbitrary behavior
change that is within the set of acceptable behaviors.
As for the purging fix, assure that arena->decay.ndirty is recorded
*after* the epoch advance and associated purging occurs. Prior to this
fix, it was possible for purging during epoch advance to cause a
substantially underrepresentative (arena->ndirty - arena->decay.ndirty),
i.e. the number of dirty pages attributed to the current epoch was too
low, and a series of unintended purges could result. This fix is also
relevant in the context of the simplification described above, but the
bug's impact would be limited to over-purging at epoch advances.
Instead, move the epoch backward in time. Additionally, add
nstime_monotonic() and use it in debug builds to assert that time only
goes backward if nstime_update() is using a non-monotonic time source.
Add missing #include <time.h>. The critical time facilities appear to
have been transitively included via unistd.h and sys/time.h, but in
principle this omission was capable of having caused
clock_gettime(CLOCK_MONOTONIC, ...) to have been overlooked in favor of
gettimeofday(), which in turn could cause spurious non-monotonic time
updates.
Refactor nstime_get() out of nstime_update() and add configure tests for
all variants.
Add CLOCK_MONOTONIC_RAW support (Linux-specific) and
mach_absolute_time() support (OS X-specific).
Do not fall back to clock_gettime(CLOCK_REALTIME, ...). This was a
fragile Linux-specific workaround, which we're unlikely to use at all
now that clock_gettime(CLOCK_MONOTONIC_RAW, ...) is supported, and if we
have no choice besides non-monotonic clocks, gettimeofday() is only
incrementally worse.
Avoid calling s2u() on raw extent sizes in extent_recycle().
Clamp psz2ind() (implemented as psz2ind_clamp()) when inserting/removing
into/from size-segregated extent heaps.
On OSX 10.12, malloc_default_zone returns a special zone that is not
present in the list of registered zones. That zone uses a "lite zone"
if one is present (apparently enabled when malloc stack logging is
enabled), or the first registered zone otherwise. In practice this
means unless malloc stack logging is enabled, the first registered
zone is the default.
So get the list of zones to get the first one, instead of relying on
malloc_default_zone.
847ff22 added a call to malloc_default_zone() before the main loop in
register_zone, effectively making malloc_default_zone() called twice
without any different outcome expected in the returned result.
It is also called once at the beginning, and a second time at the end
of the loop block.
Instead, call it only once per iteration.
Revert 245ae6036c (Support --with-lg-page
values larger than actual page size.), because it could cause VM map
fragmentation if the kernel grows mmap()ed memory downward.
This resolves#391.
Fix a fundamental extent_split_wrapper() bug in an error path.
Fix extent_recycle() to deregister unsplittable extents before leaking
them.
Relax xallocx() test assertions so that unsplittable extents don't cause
test failures.
rtree-based extent lookups remain more expensive than chunk-based run
lookups, but with this optimization the fast path slowdown is ~3 CPU
cycles per metadata lookup (on Intel Core i7-4980HQ), versus ~11 cycles
prior. The path caching speedup tends to degrade gracefully unless
allocated memory is spread far apart (as is the case when using a
mixture of sbrk() and mmap()).
rallocx() for an alignment-constrained request may end up with a
smaller-than-worst-case size if in-place reallocation succeeds due to
serendipitous alignment. In such cases, sampling may not happen.
When an allocation is large enough to trigger multiple dumps, use
modular math rather than subtraction to reset the interval counter.
Prior to this change, it was possible for a single allocation to cause
many subsequent allocations to all trigger profile dumps.
When updating usable size for a sampled object, try to cancel out
the difference between LARGE_MINCLASS and usable size from the interval
counter.
Fix huge_ralloc_no_move_expand() to update the extent's zeroed attribute
based on the intersection of the previous value and that of the newly
merged trailing extent.
Look up chunk metadata via the radix tree, rather than using
CHUNK_ADDR2BASE().
Propagate pointer's containing extent.
Minimize extent lookups by doing a single lookup (e.g. in free()) and
propagating the pointer's extent into nearly all the functions that may
need it.
This makes it possible to acquire short-term "ownership" of rtree
elements so that it is possible to read an extent pointer *and* read the
extent's contents with a guarantee that the element will not be modified
until the ownership is released. This is intended as a mechanism for
resolving rtree read/write races rather than as a way to lock extents.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements. This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size. Note that pszind_t and
szind_t are not interchangeable.
Short-circuit commonly called witness functions so that they only
execute in debug builds, and remove equivalent guards from mutex
functions. This avoids pointless code execution in
witness_assert_lockless(), which is typically called twice per
allocation/deallocation function invocation.
Inline commonly called witness functions so that optimized builds can
completely remove calls as dead code.
b2c0d6322d (Add witness, a simple online
locking validator.) caused a broad propagation of tsd throughout the
internal API, but tsd_fetch() was designed to fail prior to tsd
bootstrapping. Fix this by splitting tsd_t into non-nullable tsd_t and
nullable tsdn_t, and modifying all internal APIs that do not critically
rely on tsd to take nullable pointers. Furthermore, add the
tsd_booted_get() function so that tsdn_fetch() can probe whether tsd
bootstrapping is complete and return NULL if not. All dangerous
conversions of nullable pointers are tsdn_tsd() calls that assert-fail
on invalid conversion.
This is a broader application of optimizations to malloc() and free() in
f4a0f32d34 (Fast-path improvement:
reduce # of branches and unnecessary operations.).
This resolves#321.
If the OS overcommits:
- Commit all mappings in pages_map() regardless of whether the caller
requested committed memory.
- Linux-specific: Specify MAP_NORESERVE to avoid
unfortunate interactions with heuristic overcommit mode during
fork(2).
This resolves#193.
Split arena_choose() into arena_[i]choose() and use arena_ichoose() for
arena lookup during internal allocation. This fixes huge_palloc() so
that it always succeeds during extent node allocation.
This regression was introduced by
66cd953514 (Do not allocate metadata via
non-auto arenas, nor tcaches.).
Change test-related mangling to simplify symbol filtering.
The following commands can be used to detect missing/obsolete symbol
mangling, with the caveat that the full set of symbols is based on the
union of symbols generated by all configurations, some of which are
platform-specific:
./autogen.sh --enable-debug --enable-prof --enable-lazy-lock
make all tests
nm -a lib/libjemalloc.a src/*.jet.o \
|grep " [TDBCR] " \
|awk '{print $3}' \
|sed -e 's/^\(je_\|jet_\(n_\)\?\)\([a-zA-Z0-9_]*\)/\3/g' \
|LC_COLLATE=C sort -u \
|grep -v \
-e '^\(malloc\|calloc\|posix_memalign\|aligned_alloc\|realloc\|free\)$' \
-e '^\(m\|r\|x\|s\|d\|sd\|n\)allocx$' \
-e '^mallctl\(\|nametomib\|bymib\)$' \
-e '^malloc_\(stats_print\|usable_size\|message\)$' \
-e '^\(memalign\|valloc\)$' \
-e '^__\(malloc\|memalign\|realloc\|free\)_hook$' \
-e '^pthread_create$' \
> /tmp/private_symbols.txt
During over-allocation in preparation for creating aligned mappings,
allocate one more page than necessary if PAGE is the actual page size,
so that trimming still succeeds even if the system returns a mapping
that has less than PAGE alignment. This allows compiling with e.g. 64
KiB "pages" on systems that actually use 4 KiB pages.
Note that for e.g. --with-lg-page=21, it is also necessary to increase
the chunk size (e.g. --with-malloc-conf=lg_chunk:22) so that there are
at least two "pages" per chunk. In practice this isn't a particularly
compelling configuration because so much (unusable) virtual memory is
dedicated to chunk headers.
Refactor ph to support configurable comparison functions. Use a cpp
macro code generation form equivalent to the rb macros so that pairing
heaps can be used for both run heaps and chunk heaps.
Remove per node parent pointers, and instead use leftmost siblings' prev
pointers to track parents.
Fix multi-pass sibling merging to iterate over intermediate results
using a FIFO, rather than a LIFO. Use this fixed sibling merging
implementation for both merge phases of the auxiliary twopass algorithm
(first merging the aux list, then replacing the root with its merged
children). This fixes both degenerate merge behavior and the potential
for deep recursion.
This regression was introduced by
6bafa6678f (Pairing heap).
This resolves#371.
Move chunk_dalloc_arena()'s implementation into chunk_dalloc_wrapper(),
so that if the dalloc hook fails, proper decommit/purge/retain cascading
occurs. This fixes three potential chunk leaks on OOM paths, one during
dss-based chunk allocation, one during chunk header commit (currently
relevant only on Windows), and one during rtree write (e.g. if rtree
node allocation fails).
Merge chunk_purge_arena() into chunk_purge_default() (refactor, no
change to functionality).
Variables s and slen are declared inside a switch statement, but outside
a case scope. clang reports these variable definitions as "unreachable",
though this is not really meaningful in this case. This is the only
-Wunreachable-code warning in jemalloc.
src/util.c:501:5 [-Wunreachable-code] code will never be executed
This resolves#364.
Use pairing heap instead of red black tree in arena runs_avail. The
extra links are unioned with the bitmap_t, so this change doesn't use
any extra memory.
Canaries show this change to be a 1% cpu win, and 2% latency win. In
particular, large free()s, and small bin frees are now O(1) (barring
coalescing).
I also tested changing bin->runs to be a pairing heap, but saw a much
smaller win, and it would mean increasing the size of arena_run_s by two
pointers, so I left that as an rb-tree for now.
Initial implementation of a twopass pairing heap with aux list.
Research papers linked in comments.
Where search/nsearch/last aren't needed, this gives much faster first(),
delete(), and insert(). Insert is O(1), and first/delete don't have to
walk the whole tree.
Also tested rb_old with parent pointers - it was better than the current
rb.h for memory loads, but still much worse than a pairing heap.
An array-based heap would be much faster if everything fits in memory,
but on a cold cache it has many more memory loads for most operations.
Add a cast to avoid comparing a ssize_t value to a uint64_t value that
is always larger than a 32-bit ssize_t. This silences an innocuous
compiler warning from e.g. gcc 4.2.1 about the comparison always having
the same result.
Prior to 767d85061a (Refactor arenas array
(fixes deadlock).), it was possible under some circumstances for
arena_get() to trigger recreation of the arenas cache during tsd
cleanup, and the arenas cache would then be leaked. In principle a
similar issue could still occur as a side effect of decay-based purging,
which calls arena_tdata_get(). Fix arenas_tdata_cleanup() by setting
tsd->arenas_tdata_bypass to true, so that arena_tdata_get() will
gracefully fail (an expected behavior) rather than recreating
tsd->arena_tdata.
Reported by Christopher Ferris <cferris@google.com>.
Add missing stats.arenas.<i>.{dss,lg_dirty_mult,decay_time}
initialization.
Fix stats.arenas.<i>.{pactive,pdirty} to read under the protection of
the arena mutex.
Fix stats.cactive accounting to always increase/decrease by multiples of
the chunk size, even for huge size classes that are not multiples of the
chunk size, e.g. {2.5, 3, 3.5, 5, 7} MiB with 2 MiB chunk size. This
regression was introduced by 155bfa7da1
(Normalize size classes.) and first released in 4.0.0.
This resolves#336.
This removes an implicit conversion from size_t to ssize_t. For cactive
decreases, the size_t value was intentionally underflowed to generate
"negative" values (actually positive values above the positive range of
ssize_t), and the conversion to ssize_t was undefined according to C
language semantics.
This regression was perpetuated by
1522937e9c (Fix the cactive statistic.)
and first release in 4.0.0, which in retrospect only fixed one of two
problems introduced by aa5113b1fd
(Refactor overly large/complex functions) and first released in 3.5.0.
For small bitmaps, a linear scan of the bitmap is slightly faster than
a tree search - bitmap_t is more compact, and there are fewer writes
since we don't have to propogate state transitions up the tree.
On x86_64 with the current settings, I'm seeing ~.5%-1% CPU improvement
in production canaries with this change.
The old tree code is left since 32bit sizes are much larger (and ffsl
smaller), and maybe the run sizes will change in the future.
This resolves#339.
Add HUGE_MAXCLASS overflow checks that are specific to heap profiling
code paths. This fixes test failures that were introduced by
0c516a00c4 (Make *allocx() size class
overflow behavior defined.).
Refactor the arenas array, which contains pointers to all extant arenas,
such that it starts out as a sparse array of maximum size, and use
double-checked atomics-based reads as the basis for fast and simple
arena_get(). Additionally, reduce arenas_lock's role such that it only
protects against arena initalization races. These changes remove the
possibility for arena lookups to trigger locking, which resolves at
least one known (fork-related) deadlock.
This resolves#315.
Fix arena_size arena_new() computation to incorporate
runs_avail_nclasses elements for runs_avail, rather than
(runs_avail_nclasses - 1) elements. Since offsetof(arena_t, runs_avail)
is used rather than sizeof(arena_t) for the first term of the
computation, all of the runs_avail elements must be added into the
second term.
This bug was introduced (by Jason Evans) while merging pull request #330
as 3417a304cc (Separate arena_avail
trees).
Merge of 3417a304cc looks like a small
bug: first_best_fit doesn't scan through all the classes, since ind is
offset from runs_avail_nclasses by run_avail_bias.
Attempt mmap-based in-place huge reallocation by plumbing new_addr into
chunk_alloc_mmap(). This can dramatically speed up incremental huge
reallocation.
This resolves#335.
Separate run trees by index, replacing the previous quantize logic.
Quantization by index is now performed only on insertion / removal from
the tree, and not on node comparison, saving some cpu. This also means
we don't have to dereference the miscelm* pointers, saving half of the
memory loads from miscelms/mapbits that have fallen out of cache. A
linear scan of the indicies appears to be fast enough.
The only cost of this is an extra tree array in each arena.
In practice this bug had limited impact (and then only by increasing
chunk fragmentation) because run_quantize_ceil() returned correct
results except for inputs that could only arise from aligned allocation
requests that required more than page alignment.
This bug existed in the original run quantization implementation, which
was introduced by 8a03cf039c (Implement
cache index randomization for large allocations.).
Use a single uint64_t in nstime_t to store nanoseconds rather than using
struct timespec. This reduces fragility around conversions between long
and uint64_t, especially missing casts that only cause problems on
32-bit platforms.
This is an alternative to the existing ratio-based unused dirty page
purging, and is intended to eventually become the sole purging
mechanism.
Add mallctls:
- opt.purge
- opt.decay_time
- arena.<i>.decay
- arena.<i>.decay_time
- arenas.decay_time
- stats.arenas.<i>.decay_time
This resolves#325.
When using LinuxThreads, malloc bootstrapping deadlocks, since
malloc_tsd_boot0() ends up calling pthread_setspecific(), which causes
recursive allocation. Fix it by moving the malloc_tsd_boot0() call to
malloc_init_hard_recursible().
The deadlock was introduced by 8bb3198f72
(Refactor/fix arenas manipulation.), when tsd_boot() was split and the
top half, tsd_boot0(), got an extra tsd_wrapper_set() call.
- Combine multiple runtime branches into a single malloc_slow check.
- Avoid calling arena_choose / size2index / index2size on fast path.
- A few micro optimizations.
Fix xallocx(..., MALLOCX_ZERO to zero the last full trailing page of
large allocations that have been randomly assigned an offset of 0 when
--enable-cache-oblivious configure option is enabled. This addresses a
special case missed in d260f442ce (Fix
xallocx(..., MALLOCX_ZERO) bugs.).
Zero all trailing bytes of large allocations when
--enable-cache-oblivious configure option is enabled. This regression
was introduced by 8a03cf039c (Implement
cache index randomization for large allocations.).
Zero trailing bytes of huge allocations when resizing from/to a size
class that is not a multiple of the chunk size.
Fix prof_tctx_dump_iter() to filter out nodes that were created after
heap profile dumping started. Prior to this fix, spurious entries with
arbitrary object/byte counts could appear in heap profiles, which
resulted in jeprof inaccuracies or failures.
Simplify imallocx_prof_sample() to always operate on usize rather than
sometimes using size. This avoids redundant usize computations and
more closely fits the style adopted by i[rx]allocx_prof_sample() to fix
sampling bugs.
Fix ixallocx_prof_sample() to never modify nor create sampled small
allocations. xallocx() is in general incapable of moving small
allocations, so this fix removes buggy code without loss of generality.
Add arena_prof_tctx_reset() and use it instead of arena_prof_tctx_set()
when resetting the tctx pointer during reallocation, which happens
whenever an originally sampled reallocated object is not sampled during
reallocation.
This regression was introduced by
594c759f37 (Optimize
arena_prof_tctx_set().)
Make one call to prof_active_get_unlocked() per allocation event, and
use the result throughout the relevant functions that handle an
allocation event. Also add a missing check in prof_realloc(). These
fixes protect allocation events against concurrent prof_active changes.
Fix heap profiling to distinguish among otherwise identical sample sites
with interposed resets (triggered via the "prof.reset" mallctl). This
bug could cause data structure corruption that would most likely result
in a segfault.
When junk filling is enabled, shrinking an allocation fills the bytes
that were previously allocated but now aren't. Purging the chunk before
doing that is just a waste of time.
This resolves#260.
Fix chunk purge hook calls for in-place huge shrinking reallocation to
specify the old chunk size rather than the new chunk size. This bug
caused no correctness issues for the default chunk purge function, but
was visible to custom functions set via the "arena.<i>.chunk_hooks"
mallctl.
This resolves#264.
Fix arenas_cache_cleanup() and arena_get_hard() to handle
allocation/deallocation within the application's thread-specific data
cleanup functions even after arenas_cache is torn down.
This is a more general fix that complements
45e9f66c28 (Fix arenas_cache_cleanup().).
Fix arenas_cache_cleanup() to handle allocation/deallocation within the
application's thread-specific data cleanup functions even after
arenas_cache is torn down.
Don't bitshift by negative amounts when encoding/decoding run sizes in
chunk header maps. This affected systems with page sizes greater than 8
KiB.
Reported by Ingvar Hagelund <ingvar@redpill-linpro.com>.
Only set the unzeroed flag when initializing the entire mapbits entry,
rather than mutating just the unzeroed bit. This simplifies the
possible mapbits state transitions.
Cascade from decommit to purge when purging unused dirty pages, so that
it is possible to decommit cleaned memory rather than just purging. For
non-Windows debug builds, decommit runs rather than purging them, since
this causes access of deallocated runs to segfault.
This resolves#251.
Fix arena_ralloc_large_grow() to properly account for large_pad, so that
in-place large reallocation succeeds when possible, rather than always
failing. This regression was introduced by
8a03cf039c (Implement cache index
randomization for large allocations.)
- Decorate public function with __declspec(allocator) and __declspec(restrict), just like MSVC 1900
- Support JEMALLOC_HAS_RESTRICT by defining the restrict keyword
- Move __declspec(nothrow) between 'void' and '*' so it compiles once more
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls. The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.
Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained. This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.
Fix chunk purging. Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.
This resolves#176 and #201.
Fix huge_ralloc_no_move() to succeed if an allocation request results in
the same usable size as the existing allocation, even if the request
size is smaller than the usable size. This bug did not cause
correctness issues, but it could cause unnecessary moves during
reallocation.
huge_ralloc() passes a size that may not be precisely a size class, so
make huge_palloc() handle the more general case of a size input rather
than usize.
This regression appears to have been introduced by the addition of
in-place huge reallocation; as such it was never incorporated into a
release.
Create and use FMT* macros that are equivalent to the PRI* macros that
inttypes.h defines. This allows uniform use of the Unix-specific format
specifiers, e.g. "%zu", as well as avoiding Windows-specific definitions
of e.g. PRIu64.
Add ffs()/ffsl() support for compiling with gcc.
Extract compatibility definitions of ENOENT, EINVAL, EAGAIN, EPERM,
ENOMEM, and ENORANGE into include/msvc_compat/windows_extra.h and
use the file for tests as well as for core jemalloc code.
Replace JEMALLOC_ATTR(format(printf, ...). with
JEMALLOC_FORMAT_PRINTF(), so that configuration feature tests can
omit the attribute if it would cause extraneous compilation warnings.
As per gcc documentation:
The alloc_size attribute is used to tell the compiler that the function
return value points to memory (...)
This resolves#245.
This effectively reverts 97c04a9383 (Use
first-fit rather than first-best-fit run/chunk allocation.). In some
pathological cases, first-fit search dominates allocation time, and it
also tends not to converge as readily on a steady state of memory
layout, since precise allocation order has a bigger effect than for
first-best-fit.
Add various function attributes to the exported functions to give the
compiler more information to work with during optimization, and also
specify throw() when compiling with C++ on Linux, in order to adequately
match what __THROW does in glibc.
This resolves#237.
Conditionally define ENOENT, EINVAL, etc. (was unconditional).
Add/use PRIzu, PRIzd, and PRIzx for use in malloc_printf() calls. gcc issued
(harmless) warnings since e.g. "%zu" should be "%Iu" on Windows, and the
alternative to this workaround would have been to disable the function
attributes which cause gcc to look for type mismatches in formatted printing
function calls.
- Set opt_lg_chunk based on run-time OS setting
- Verify LG_PAGE is compatible with run-time OS setting
- When targeting Windows Vista or newer, use SRWLOCK instead of CRITICAL_SECTION
- When targeting Windows Vista or newer, statically initialize init_lock
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.
Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.
This resolves#232.
This avoids the potential surprise of deallocating an object with one
tcache specified, and having the object cached in a different tcache
once it drains from the quarantine.
Now that small allocation runs have fewer regions due to run metadata
residing in chunk headers, an explicit minimum tcache count is needed to
make sure that tcache adequately amortizes synchronization overhead.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.
Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.
For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size. Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.
Add the --disable-cache-oblivious option, primarily intended for
performance testing.
This resolves#13.
This rename avoids installation collisions with the upstream gperftools.
Additionally, jemalloc's per thread heap profile functionality
introduced an incompatible file format, so it's now worthwhile to
clearly distinguish jemalloc's version of this script from the upstream
version.
This resolves#229.
Fix the shrinking case of huge_ralloc_no_move_similar() to purge the
correct number of pages, at the correct offset. This regression was
introduced by 8d6a3e8321 (Implement
dynamic per arena control over dirty page purging.).
Fix huge_ralloc_no_move_shrink() to purge the correct number of pages.
This bug was introduced by 9673983443
(Purge/zero sub-chunk huge allocations as necessary.).
Fix arena_get() calls that specify refresh_if_missing=false. In
ctl_refresh() and ctl.c's arena_purge(), these calls attempted to only
refresh once, but did so in an unreliable way.
arena_i_lg_dirty_mult_ctl() was simply wrong to pass
refresh_if_missing=false.
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.
This resolves#211.
Add mallctls:
- arenas.lg_dirty_mult is initialized via opt.lg_dirty_mult, and can be
modified to change the initial lg_dirty_mult setting for newly created
arenas.
- arena.<i>.lg_dirty_mult controls an individual arena's dirty page
purging threshold, and synchronously triggers any purging that may be
necessary to maintain the constraint.
- arena.<i>.chunk.purge allows the per arena dirty page purging function
to be replaced.
This resolves#93.
Remove the prof_tctx_state_destroying transitory state and instead add
the tctx_uid field, so that the tuple <thr_uid, tctx_uid> uniquely
identifies a tctx. This assures that tctx's are well ordered even when
more than two with the same thr_uid coexist. A previous attempted fix
based on prof_tctx_state_destroying was only sufficient for protecting
against two coexisting tctx's, but it also introduced a new dumping
race.
These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.) and 764b00023f (Fix a heap
profiling regression.).
Add the prof_tctx_state_destroying transitionary state to fix a race
between a thread destroying a tctx and another thread creating a new
equivalent tctx.
This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
a14bce85 made buferror not take an error code, and make the Windows
code path for buferror use GetLastError, while the alternative code
paths used errno. Then 2a83ed02 made buferror take an error code
again, and while it changed the non-Windows code paths to use that
error code, the Windows code path was not changed accordingly.
Fix prof_tctx_comp() to incorporate tctx state into the comparison.
During a dump it is possible for both a purgatory tctx and an otherwise
equivalent nominal tctx to reside in the tree at the same time.
This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
This tends to more effectively pack active memory toward low addresses.
However, additional tree searches are required in many cases, so whether
this change stands the test of time will depend on real-world
benchmarks.
Treat sizes that round down to the same size class as size-equivalent
in trees that are used to search for first best fit, so that there are
only as many "firsts" as there are size classes. This comes closer to
the ideal of first fit.
Rename "dirty chunks" to "cached chunks", in order to avoid overloading
the term "dirty".
Fix the regression caused by 339c2b23b2
(Fix chunk_unmap() to propagate dirty state.), and actually address what
that change attempted, which is to only purge chunks once, and propagate
whether zeroed pages resulted into chunk_record().
Fix chunk_unmap() to propagate whether a chunk is dirty, and modify
dirty chunk purging to record this information so it can be passed to
chunk_unmap(). Since the broken version of chunk_unmap() claimed that
all chunks were clean, this resulted in potential memory corruption for
purging implementations that do not zero (e.g. MADV_FREE).
This regression was introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs. Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty. Preferrentially
allocate dirty chunks. When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).
Refactor extent_node_t to provide accessor functions.
8ddc93293c switched this to over using the
address tree in order to avoid false negatives, so it now needs to check
that the size of the free extent is large enough to satisfy the request.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.
Add chunk node caching to arenas, in order to avoid contention on the
base allocator.
Use chunks_rtree to look up huge allocations rather than a red-black
tree. Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).
Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled. The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.
Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering. These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.
Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.
Add the tcache.create, tcache.flush, and tcache.destroy mallctls.
This resolves#145.
Recent huge allocation refactoring associates huge allocations with
arenas, but it remains necessary to quickly look up huge allocation
metadata during reallocation/deallocation. A global radix tree remains
a good solution to this problem, but locking would have become the
primary bottleneck after (upcoming) migration of chunk management from
global to per arena data structures.
This lock-free implementation uses double-checked reads to traverse the
tree, so that in the steady state, each read or write requires only a
single atomic operation.
This implementation also assures that no more than two tree levels
actually exist, through a combination of careful virtual memory
allocation which makes large sparse nodes cheap, and skipping the root
node on x64 (possible because the top 16 bits are all 0 in practice).
Refactor base_alloc() to guarantee that allocations are carved from
demand-zeroed virtual memory. This supports sparse data structures such
as multi-page radix tree nodes.
Enhance base_alloc() to keep track of fragments which were too small to
support previous allocation requests, and try to consume them during
subsequent requests. This becomes important when request sizes commonly
approach or exceed the chunk size (as could radix tree node
allocations).
Fix chunk_recycle()'s new_addr functionality to search by address rather
than just size if new_addr is specified. The functionality added by
a95018ee81 (Attempt to expand huge
allocations in-place.) only worked if the two search orders happened to
return the same results (e.g. in simple test cases).
The documentation for opt.lg_dirty_mult says:
Per-arena minimum ratio (log base 2) of active to dirty
pages. Some dirty unused pages may be allowed to accumulate,
within the limit set by the ratio (or one chunk worth of dirty
pages, whichever is greater) (...)
The restriction in parentheses currently doesn't happen. This makes
jemalloc aggressively madvise(), which in turns increases the amount
of page faults significantly.
For instance, this resulted in several(!) hundred(!) milliseconds
startup regression on Firefox for Android.
This may require further tweaking, but starting with actually doing
what the documentation says is a good start.
This feature makes it possible to toggle the gdump feature on/off during
program execution, whereas the the opt.prof_dump mallctl value can only
be set during program startup.
This resolves#72.
Avoid calling chunk_recycle() for mmap()ed chunks if config_munmap is
disabled, in which case there are never any recyclable chunks.
This resolves#164.
There are three categories of metadata:
- Base allocations are used for bootstrap-sensitive internal allocator
data structures.
- Arena chunk headers comprise pages which track the states of the
non-metadata pages.
- Internal allocations differ from application-originated allocations
in that they are for internal use, and that they are omitted from heap
profiles.
The metadata statistics comprise the metadata categories as follows:
- stats.metadata: All metadata -- base + arena chunk headers + internal
allocations.
- stats.arenas.<i>.metadata.mapped: Arena chunk headers.
- stats.arenas.<i>.metadata.allocated: Internal allocations. This is
reported separately from the other metadata statistics because it
overlaps with the allocated and active statistics, whereas the other
metadata statistics do not.
Base allocations are not reported separately, though their magnitude can
be computed by subtracting the arena-specific metadata.
This resolves#163.
Refactor bootstrapping to delay tsd initialization, primarily to support
integration with FreeBSD's libc.
Refactor a0*() for internal-only use, and add the
bootstrap_{malloc,calloc,free}() API for use by FreeBSD's libc. This
separation limits use of the a0*() functions to metadata allocation,
which doesn't require malloc/calloc/free API compatibility.
This resolves#170.
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.
This resolves#172.
Fix OOM cleanup in huge_palloc() to call idalloct() rather than
base_node_dalloc(). This bug is a result of incomplete refactoring, and
has no impact other than leaking memory during OOM.
This provides in-place expansion of huge allocations when the end of the
allocation is at the end of the sbrk heap. There's already the ability
to extend in-place via recycled chunks but this handles the initial
growth of the heap via repeated vector / string reallocations.
A possible future extension could allow realloc to go from the following:
| huge allocation | recycled chunks |
^ dss_end
To a larger allocation built from recycled *and* new chunks:
| huge allocation |
^ dss_end
Doing that would involve teaching the chunk recycling code to request
new chunks to satisfy the request. The chunk_dss code wouldn't require
any further changes.
#include <stdlib.h>
int main(void) {
size_t chunk = 4 * 1024 * 1024;
void *ptr = NULL;
for (size_t size = chunk; size < chunk * 128; size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
}
}
dss:secondary: 0.083s
dss:primary: 0.083s
After:
dss:secondary: 0.083s
dss:primary: 0.003s
The dss heap grows in the upwards direction, so the oldest chunks are at
the low addresses and they are used first. Linux prefers to grow the
mmap heap downwards, so the trick will not work in the *current* mmap
chunk allocator as a huge allocation will only be at the top of the heap
in a contrived case.
Fix quarantine to actually update tsd when expanding, and to avoid
double initialization (leaking the first quarantine) due to recursive
initialization.
This resolves#161.
* use sized deallocation in iralloct_realign
* iralloc and ixalloc always need the old size, so pass it in from the
caller where it's often already calculated
The size of the source allocation is known at this point, so reading the
chunk header can be avoided for the small size class fast path. This is
not very useful right now, but it provides a significant performance
boost with an alternate ralloc entry point taking the old size.
Purge trailing pages during shrinking huge reallocation when resulting
size is not a multiple of the chunk size. Similarly, zero pages if
necessary during growing huge reallocation when the resulting size is
not a multiple of the chunk size.
Add the 'util' column, which reports the proportion of available regions
that are currently in use for each small size class. Small run
utilization is the complement of external fragmentation. For example,
utilization of 0.75 indicates that 25% of small run memory is consumed
by external fragmentation, in other (more obtuse) words, 33% external
fragmentation overhead.
This resolves#27.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
stats.arenas.<i>.hchunks.<j>.ndalloc,
stats.arenas.<i>.hchunks.<j>.nrequests, and
stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation. This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.
Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t). These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
Add:
--with-lg-page
--with-lg-page-sizes
--with-lg-size-class-group
--with-lg-quantum
Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.
Fix various edge conditions exposed by the configure options.
atexit(3) can deadlock internally during its own initialization if
jemalloc calls atexit() during jemalloc initialization. Mitigate the
impact by restructuring prof initialization to avoid calling atexit()
unless the registered function will actually dump a final heap profile.
Additionally, disable prof_final by default so that this land mine is
opt-in rather than opt-out.
This resolves#144.
This avoids grabbing the base mutex, as a step towards fine-grained
locking for huge allocations. The thread cache also provides a tiny
(~3%) improvement for serial huge allocations.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind]. Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.
Add a tsd-based arenas_cache, which amortizes arenas reads. This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.
Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).
Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).
Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used. Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes. Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB). Large size classes range from [16 KiB .. 4 MiB). Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).
This adds support for expanding huge allocations in-place by requesting
memory at a specific address from the chunk allocator.
It's currently only implemented for the chunk recycling path, although
in theory it could also be done by optimistically allocating new chunks.
On Linux, it could attempt an in-place mremap. However, that won't work
in practice since the heap is grown downwards and memory is not unmapped
(in a normal build, at least).
Repeated vector reallocation micro-benchmark:
#include <string.h>
#include <stdlib.h>
int main(void) {
for (size_t i = 0; i < 100; i++) {
void *ptr = NULL;
size_t old_size = 0;
for (size_t size = 4; size < (1 << 30); size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
memset(ptr + old_size, 0xff, size - old_size);
old_size = size;
}
free(ptr);
}
}
The glibc allocator fails to do any in-place reallocations on this
benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it
elides the cost of copies via mremap, which is currently not something
that jemalloc can use.
With this improvement, jemalloc still fails to do any in-place huge
reallocations for the first outer loop, but then succeeds 100% of the
time for the remaining 99 iterations. The time spent doing allocations
and copies drops down to under 5%, with nearly all of it spent doing
purging + faulting (when huge pages are disabled) and the array memset.
An improved mremap API (MREMAP_RETAIN - #138) would be far more general
but this is a portable optimization and would still be useful on Linux
for xallocx.
Numbers with transparent huge pages enabled:
glibc (copies elided via MREMAP_MAYMOVE): 8.471s
jemalloc: 17.816s
jemalloc + no-op madvise: 13.236s
jemalloc + this commit: 6.787s
jemalloc + this commit + no-op madvise: 6.144s
Numbers with transparent huge pages disabled:
glibc (copies elided via MREMAP_MAYMOVE): 15.403s
jemalloc: 39.456s
jemalloc + no-op madvise: 12.768s
jemalloc + this commit: 15.534s
jemalloc + this commit + no-op madvise: 6.354s
Closes#137
Fix an OOM-related regression in arena_tcache_fill_small() that caused
cache corruption that would almost certainly expose the application to
undefined behavior, usually in the form of an allocation request
returning an already-allocated region, or somewhat less likely, a freed
region that had already been returned to the arena, thus making it
available to the arena for any purpose.
This regression was introduced by
9c43c13a35 (Reverse tcache fill order.),
and was present in all releases from 2.2.0 through 3.6.0.
This resolves#98.
Fix prof regressions related to tdata (main per thread profiling data
structure) destruction:
- Deadlock. The fix for this was intended to be part of
20c31deaae (Test prof.reset mallctl and
fix numerous discovered bugs.) but the fix was left incomplete.
- Destruction race. Detaching tdata just prior to destruction without
holding the tdatas lock made it possible for another thread to destroy
the tdata out from under the thread that was on its way to doing so.
Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.). These regressions were twofold:
1) tsd_tryget() should never (and need never) return NULL. Rename it to
tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
because cleanup happens during the nominal-->purgatory transition,
and re-initialization must not happen while in the purgatory state.
Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get()
can still be used as long as no re-initialization that would require
cleanup occurs. This means that e.g. the thread_allocated counter
can be updated unconditionally.
Implement/test/fix the opt.prof_thread_active_init,
prof.thread_active_init, and thread.prof.active mallctl's.
Test/fix the thread.prof.name mallctl.
Refactor opt_prof_active to be read-only and move mutable state into the
prof_active variable. Stop leaning on ctl-related locking for
protection.