Add spin_t and spin_{init,adaptive}(), which provide a simple
abstraction for adaptive spinning.
Adaptively spin during busy waits in bootstrapping and rtree node
initialization.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements. This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size. Note that pszind_t and
szind_t are not interchangeable.
GCC 4.9.3 cross-compiled for sparc64 defines __sparc_v9__, not
__sparc64__ nor __sparcv9. This prevents LG_QUANTUM from being defined
properly. Adding this new value to the check solves the issue.
Some bug (either in the red-black tree code, or in the pgi compiler) seems to
cause red-black trees to become unbalanced. This issue seems to go away if we
don't use compact red-black trees. Since red-black trees don't seem to be used
much anymore, I opted for what seems to be an easy fix here instead of digging
in and trying to find the root cause of the bug.
Some context in case it's helpful:
I experienced a ton of segfaults while using pgi as Chapel's target compiler
with jemalloc 4.0.4. The little bit of debugging I did pointed me somewhere
deep in red-black tree manipulation, but I didn't get a chance to investigate
further. It looks like 4.2.0 replaced most uses of red-black trees with
pairing-heaps, which seems to avoid whatever bug I was hitting.
However, `make check_unit` was still failing on the rb test, so I figured the
core issue was just being masked. Here's the `make check_unit` failure:
```sh
=== test/unit/rb ===
test_rb_empty: pass
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
node_remove:test/unit/rb.c:190: Failed assertion: (imbalances) == (0) --> 2 != 0: Tree is unbalanced
<jemalloc>: test/unit/rb.c:43: Failed assertion: "pathp[-1].cmp < 0"
test/test.sh: line 22: 12926 Aborted
Test harness error
```
While starting to debug I saw the RB_COMPACT option and decided to check if
turning that off resolved the bug. It seems to have fixed it (`make check_unit`
passes and the segfaults under Chapel are gone) so it seems like on okay
work-around. I'd imagine this has performance implications for red-black trees
under pgi, but if they're not going to be used much anymore it's probably not a
big deal.
Revert 245ae6036c (Support --with-lg-page
values larger than actual page size.), because it could cause VM map
fragmentation if the kernel grows mmap()ed memory downward.
This resolves#391.
Short-circuit commonly called witness functions so that they only
execute in debug builds, and remove equivalent guards from mutex
functions. This avoids pointless code execution in
witness_assert_lockless(), which is typically called twice per
allocation/deallocation function invocation.
Inline commonly called witness functions so that optimized builds can
completely remove calls as dead code.
b2c0d6322d (Add witness, a simple online
locking validator.) caused a broad propagation of tsd throughout the
internal API, but tsd_fetch() was designed to fail prior to tsd
bootstrapping. Fix this by splitting tsd_t into non-nullable tsd_t and
nullable tsdn_t, and modifying all internal APIs that do not critically
rely on tsd to take nullable pointers. Furthermore, add the
tsd_booted_get() function so that tsdn_fetch() can probe whether tsd
bootstrapping is complete and return NULL if not. All dangerous
conversions of nullable pointers are tsdn_tsd() calls that assert-fail
on invalid conversion.
This is a broader application of optimizations to malloc() and free() in
f4a0f32d34 (Fast-path improvement:
reduce # of branches and unnecessary operations.).
This resolves#321.
Split arena_choose() into arena_[i]choose() and use arena_ichoose() for
arena lookup during internal allocation. This fixes huge_palloc() so
that it always succeeds during extent node allocation.
This regression was introduced by
66cd953514 (Do not allocate metadata via
non-auto arenas, nor tcaches.).
During over-allocation in preparation for creating aligned mappings,
allocate one more page than necessary if PAGE is the actual page size,
so that trimming still succeeds even if the system returns a mapping
that has less than PAGE alignment. This allows compiling with e.g. 64
KiB "pages" on systems that actually use 4 KiB pages.
Note that for e.g. --with-lg-page=21, it is also necessary to increase
the chunk size (e.g. --with-malloc-conf=lg_chunk:22) so that there are
at least two "pages" per chunk. In practice this isn't a particularly
compelling configuration because so much (unusable) virtual memory is
dedicated to chunk headers.
Refactor ph to support configurable comparison functions. Use a cpp
macro code generation form equivalent to the rb macros so that pairing
heaps can be used for both run heaps and chunk heaps.
Remove per node parent pointers, and instead use leftmost siblings' prev
pointers to track parents.
Fix multi-pass sibling merging to iterate over intermediate results
using a FIFO, rather than a LIFO. Use this fixed sibling merging
implementation for both merge phases of the auxiliary twopass algorithm
(first merging the aux list, then replacing the root with its merged
children). This fixes both degenerate merge behavior and the potential
for deep recursion.
This regression was introduced by
6bafa6678f (Pairing heap).
This resolves#371.
The arenas_extend() function was renamed to arenas_init() in commit
8bb3198f72, but its function declaration
was not removed from jemalloc_internal.h.in.
Initial implementation of a twopass pairing heap with aux list.
Research papers linked in comments.
Where search/nsearch/last aren't needed, this gives much faster first(),
delete(), and insert(). Insert is O(1), and first/delete don't have to
walk the whole tree.
Also tested rb_old with parent pointers - it was better than the current
rb.h for memory loads, but still much worse than a pairing heap.
An array-based heap would be much faster if everything fits in memory,
but on a cold cache it has many more memory loads for most operations.
Refactor the arenas array, which contains pointers to all extant arenas,
such that it starts out as a sparse array of maximum size, and use
double-checked atomics-based reads as the basis for fast and simple
arena_get(). Additionally, reduce arenas_lock's role such that it only
protects against arena initalization races. These changes remove the
possibility for arena lookups to trigger locking, which resolves at
least one known (fork-related) deadlock.
This resolves#315.
Use a single uint64_t in nstime_t to store nanoseconds rather than using
struct timespec. This reduces fragility around conversions between long
and uint64_t, especially missing casts that only cause problems on
32-bit platforms.
This is an alternative to the existing ratio-based unused dirty page
purging, and is intended to eventually become the sole purging
mechanism.
Add mallctls:
- opt.purge
- opt.decay_time
- arena.<i>.decay
- arena.<i>.decay_time
- arenas.decay_time
- stats.arenas.<i>.decay_time
This resolves#325.
- Combine multiple runtime branches into a single malloc_slow check.
- Avoid calling arena_choose / size2index / index2size on fast path.
- A few micro optimizations.
Fix xallocx(..., MALLOCX_ZERO to zero the last full trailing page of
large allocations that have been randomly assigned an offset of 0 when
--enable-cache-oblivious configure option is enabled. This addresses a
special case missed in d260f442ce (Fix
xallocx(..., MALLOCX_ZERO) bugs.).
Don't assume Bourne shell is in /bin/sh when running size_classes.sh .
Consider __sparcv9 a synonym for __sparc64__ when defining LG_QUANTUM.
This resolves#275.
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls. The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.
Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained. This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.
Fix chunk purging. Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.
This resolves#176 and #201.
Take large_pad into account when determining whether an aligned
allocation can be satisfied by a large size class.
This regression was introduced by
8a03cf039c (Implement cache index
randomization for large allocations.).
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.
Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.
This resolves#232.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.
Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.
For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size. Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.
Add the --disable-cache-oblivious option, primarily intended for
performance testing.
This resolves#13.
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.
This resolves#211.
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs. Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty. Preferrentially
allocate dirty chunks. When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).
Refactor extent_node_t to provide accessor functions.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.
Add chunk node caching to arenas, in order to avoid contention on the
base allocator.
Use chunks_rtree to look up huge allocations rather than a red-black
tree. Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).
Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled. The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.
Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering. These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.
Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.
Add the tcache.create, tcache.flush, and tcache.destroy mallctls.
This resolves#145.
Fix arena_get() to refresh the cache as needed in the (!init_if_missing
&& refresh_if_missing) case.
This flaw was introduced by the initial arena_get() implementation,
which was part of 8bb3198f72 (Refactor/fix
arenas manipulation.).