Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on. This
prepares for supporting both lazy and forced purging on different decay
schedules.
These were all size_ts, so we have atomics support for them on all platforms, so
the conversion is straightforward.
Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
I expect this to be the trickiest conversion we will see, since we want atomics
on 64-bit platforms, but are also always able to piggyback on some sort of
external synchronization on non-64 bit platforms.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
Rather than purging uncoalesced extents, perform just enough incremental
coalescing to purge only fully coalesced extents. In the absence of
cached extent reuse, the immediate versus delayed incremental purging
algorithms result in the same purge order.
This resolves#655.
Convert the nrequests field to be partially derived, and the curlextents
to be fully derived, in order to reduce the number of stats updates
needed during common operations.
This change affects ndalloc stats during arena reset, because it is no
longer possible to cancel out ndalloc effects (curlextents would become
negative).
Extent splitting and coalescing is a major component of large allocation
overhead, and disabling coalescing of cached extents provides a simple
and effective hysteresis mechanism. Once two-phase purging is
implemented, it will probably make sense to leave coalescing disabled
for the first phase, but coalesce during the second phase.
Refactor arena and extent locking protocols such that arena and
extent locks are never held when calling into the extent_*_wrapper()
API. This requires extra care during purging since the arena lock no
longer protects the inner purging logic. It also requires extra care to
protect extents from being merged with adjacent extents.
Convert extent_t's 'active' flag to an enumerated 'state', so that
retained extents are explicitly marked as such, rather than depending on
ring linkage state.
Refactor the extent collections (and their synchronization) for cached
and retained extents into extents_t. Incorporate LRU functionality to
support purging. Incorporate page count accounting, which replaces
arena->ndirty and arena->stats.retained.
Assert that no core locks are held when entering any internal
[de]allocation functions. This is in addition to existing assertions
that no locks are held when entering external [de]allocation functions.
Audit and document synchronization protocols for all arena_t fields.
This fixes a potential deadlock due to recursive allocation during
gdump, in a similar fashion to b49c649bc1
(Fix lock order reversal during gdump.), but with a necessarily much
broader code impact.
Add/rename related mallctls:
- Add stats.arenas.<i>.base .
- Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal .
- Add stats.arenas.<i>.resident .
Modify the arenas.extend mallctl to take an optional (extent_hooks_t *)
argument so that it is possible for all base allocations to be serviced
by the specified extent hooks.
This resolves#463.
If virtual memory is retained, allocate extents such that their sizes
form an exponentially growing series. This limits the number of
disjoint virtual memory ranges so that extent merging can be effective
even if multiple arenas' extent allocation requests are highly
interleaved.
This resolves#462.
Rewrite arena_slab_regind() to provide sufficient constant data for
the compiler to perform division strength reduction. This replaces
more general manual strength reduction that was implemented before
arena_bin_info was compile-time-constant. It would be possible to
slightly improve on the compiler-generated division code by taking
advantage of range limits that the compiler doesn't know about.
Add extent serial numbers and use them where appropriate as a sort key
that is higher priority than address, so that the allocation policy
prefers older extents.
This resolves#147.
Add an "over-size" extent heap in which to store extents which exceed
the maximum size class (plus cache-oblivious padding, if enabled).
Remove psz2ind_clamp() and use psz2ind() instead so that trying to
allocate the maximum size class can in principle succeed. In practice,
this allows assertions to hold so that OOM errors can be successfully
generated.
Fix extent_alloc_cache[_locked]() to support decommitted allocation, and
use this ability in arena_stash_dirty(), so that decommitted extents are
not needlessly committed during purging. In practice this does not
happen on any currently supported systems, because both extent merging
and decommit must be implemented; all supported systems implement one
xor the other.
Rather than protecting dss operations with a mutex, use atomic
operations. This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
extent_in_dss() and the newly added extent_dss_mergeable(), which can be
called multiple times during extent deallocations.
This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.
This resolves#425.
Simplify decay-based purging attempts to only be triggered when the
epoch is advanced, rather than every time purgeable memory increases.
In a correctly functioning system (not previously the case; see below),
this only causes a behavior difference if during subsequent purge
attempts the least recently used (LRU) purgeable memory extent is
initially too large to be purged, but that memory is reused between
attempts and one or more of the next LRU purgeable memory extents are
small enough to be purged. In practice this is an arbitrary behavior
change that is within the set of acceptable behaviors.
As for the purging fix, assure that arena->decay.ndirty is recorded
*after* the epoch advance and associated purging occurs. Prior to this
fix, it was possible for purging during epoch advance to cause a
substantially underrepresentative (arena->ndirty - arena->decay.ndirty),
i.e. the number of dirty pages attributed to the current epoch was too
low, and a series of unintended purges could result. This fix is also
relevant in the context of the simplification described above, but the
bug's impact would be limited to over-purging at epoch advances.
Instead, move the epoch backward in time. Additionally, add
nstime_monotonic() and use it in debug builds to assert that time only
goes backward if nstime_update() is using a non-monotonic time source.
Avoid calling s2u() on raw extent sizes in extent_recycle().
Clamp psz2ind() (implemented as psz2ind_clamp()) when inserting/removing
into/from size-segregated extent heaps.
When an allocation is large enough to trigger multiple dumps, use
modular math rather than subtraction to reset the interval counter.
Prior to this change, it was possible for a single allocation to cause
many subsequent allocations to all trigger profile dumps.
When updating usable size for a sampled object, try to cancel out
the difference between LARGE_MINCLASS and usable size from the interval
counter.
Look up chunk metadata via the radix tree, rather than using
CHUNK_ADDR2BASE().
Propagate pointer's containing extent.
Minimize extent lookups by doing a single lookup (e.g. in free()) and
propagating the pointer's extent into nearly all the functions that may
need it.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements. This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size. Note that pszind_t and
szind_t are not interchangeable.
b2c0d6322d (Add witness, a simple online
locking validator.) caused a broad propagation of tsd throughout the
internal API, but tsd_fetch() was designed to fail prior to tsd
bootstrapping. Fix this by splitting tsd_t into non-nullable tsd_t and
nullable tsdn_t, and modifying all internal APIs that do not critically
rely on tsd to take nullable pointers. Furthermore, add the
tsd_booted_get() function so that tsdn_fetch() can probe whether tsd
bootstrapping is complete and return NULL if not. All dangerous
conversions of nullable pointers are tsdn_tsd() calls that assert-fail
on invalid conversion.
This is a broader application of optimizations to malloc() and free() in
f4a0f32d34 (Fast-path improvement:
reduce # of branches and unnecessary operations.).
This resolves#321.
Split arena_choose() into arena_[i]choose() and use arena_ichoose() for
arena lookup during internal allocation. This fixes huge_palloc() so
that it always succeeds during extent node allocation.
This regression was introduced by
66cd953514 (Do not allocate metadata via
non-auto arenas, nor tcaches.).