Develop new data structure and code logic for holding profiling
related information stored in the extent that may be needed after the
extent is released, which in particular is the case for the
reallocation code path (e.g. in `rallocx()` and `xallocx()`). The
data structure is a generalization of `prof_tctx_t`: we previously
only copy out the `prof_tctx` before the extent is released, but we
may be in need of additional fields. Currently the only additional
field is the allocation time field, but there may be more fields in
the future.
The restructuring also resolved a bug: `prof_realloc()` mistakenly
passed the new `ptr` to `prof_free_sampled_object()`, but passing in
the `old_ptr` would crash because it's already been released. Now
the essential profiling information is collectively copied out early
and safely passed to `prof_free_sampled_object()` after the extent is
released.
This is part of a broader change to make header files better represent the
dependencies between one another (see
https://github.com/jemalloc/jemalloc/issues/533). It breaks up component headers
into smaller parts that can be made to have a simpler dependency graph.
For the autogenerated headers (smoothstep.h and size_classes.h), no splitting
was necessary, so I didn't add support to emit multiple headers.
Split purging into lazy and forced variants. Use the forced variant for
zeroing dss.
Add support for NULL function pointers as an opt-out mechanism for the
dalloc, commit, decommit, purge_lazy, purge_forced, split, and merge
fields of extent_hooks_t.
Add short-circuiting checks in large_ralloc_no_move_{shrink,expand}() so
that no attempt is made if splitting/merging is not supported.
This resolves#268.
Add extent serial numbers and use them where appropriate as a sort key
that is higher priority than address, so that the allocation policy
prefers older extents.
This resolves#147.
Fix extent_alloc_cache[_locked]() to support decommitted allocation, and
use this ability in arena_stash_dirty(), so that decommitted extents are
not needlessly committed during purging. In practice this does not
happen on any currently supported systems, because both extent merging
and decommit must be implemented; all supported systems implement one
xor the other.
Rather than protecting dss operations with a mutex, use atomic
operations. This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
extent_in_dss() and the newly added extent_dss_mergeable(), which can be
called multiple times during extent deallocations.
This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.
This resolves#425.
rtree-based extent lookups remain more expensive than chunk-based run
lookups, but with this optimization the fast path slowdown is ~3 CPU
cycles per metadata lookup (on Intel Core i7-4980HQ), versus ~11 cycles
prior. The path caching speedup tends to degrade gracefully unless
allocated memory is spread far apart (as is the case when using a
mixture of sbrk() and mmap()).
Cascade from decommit to purge when purging unused dirty pages, so that
it is possible to decommit cleaned memory rather than just purging. For
non-Windows debug builds, decommit runs rather than purging them, since
this causes access of deallocated runs to segfault.
This resolves#251.
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls. The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.
Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained. This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.
Fix chunk purging. Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.
This resolves#176 and #201.
Rename "dirty chunks" to "cached chunks", in order to avoid overloading
the term "dirty".
Fix the regression caused by 339c2b23b2
(Fix chunk_unmap() to propagate dirty state.), and actually address what
that change attempted, which is to only purge chunks once, and propagate
whether zeroed pages resulted into chunk_record().
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs. Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty. Preferrentially
allocate dirty chunks. When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).
Refactor extent_node_t to provide accessor functions.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.
Add chunk node caching to arenas, in order to avoid contention on the
base allocator.
Use chunks_rtree to look up huge allocations rather than a red-black
tree. Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).
Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled. The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.
Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering. These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.
Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.