It appears like a simple typo means we're unconditionally overwriting
some fields in hpa_from_pai when asserts are enabled. From hpa_shard_init,
it looks like these fields have these values anyway, so this shouldn't
cause bugs, but if something is wrong it seems better to have these
asserts in place.
See issue #2412.
nstime module guarantees monotonic clock update within a single nstime_t. This
means, if two separate nstime_t variables are read and updated separately,
nstime_subtract between them may result in underflow. Fixed by switching to the
time since utility provided by nstime.
Currently used only for guarding purposes, the hint is used to determine
if the allocation is supposed to be frequently reused. For example, it
might urge the allocator to ensure the allocation is cached.
Some nstime_t operations require and assume the input nstime is initialized
(e.g. nstime_update) -- uninitialized input may cause silent failures which is
difficult to reproduce / debug. Add an explicit flag to track the state
(limited to debug build only).
Also fixed an use case in hpa (time of last_purge).
In order for nstime_update to handle non-monotonic clocks, it requires the input
nstime to be initialized -- when reading for the first time, zero init has to be
done. Otherwise random stack value may be seen as clocks and returned.
As the code evolves, some code paths that have previously assigned
deferred_work_generated may cease being reached. This would leave the value
uninitialized. This change initializes the value for safety.
Adding guarded extents, which are regular extents surrounded by guard pages
(mprotected). To reduce syscalls, small guarded extents are cached as a
separate eset in ecache, and decay through the dirty / muzzy / retained pipeline
as usual.
This change allows every allocator conforming to PAI communicate that it
deferred some work for the future. Without it if a background thread goes into
indefinite sleep, there is no way to notify it about upcoming deferred work.
Previously the calculation of sleep time between wakeups was implemented within
background_thread. This resulted in some parts of decay and hpa specific
logic mixing with background thread implementation. In this change, background
thread delegates this calculation to arena and it, in turn, delegates it to PAI.
The next step is to implement the actual calculation of time until deferred work
in HPA.
The edata_cache_small had a fill/flush heuristic. In retrospect, this was a
premature optimization; more testing indicates that an unbounded cache is
effectively fine here, and moreover we spend a nontrivial amount of time doing
unnecessary filling/flushing.
As the HPA takes on a larger and larger fraction of all allocations, any
theoretical differences in allocation patterns should shrink. The HPA is more
efficient with its metadata in general, so it still comes out ahead on metadata
usage anyways.
We wait a while after deciding a huge extent should get hugified to see if it
gets purged before long. This avoids hugifying extents that might shortly get
dehugified for purging.
Rename and use the hpa_dehugification_threshold option support code for this,
since it's now ignored.
This fixes two simple but significant typos in the HPA:
- The conf string parsing accidentally set a min value of PAGE for
hpa_sec_batch_fill_extra; i.e. allocating 4096 extra pages every time we
attempted to allocate a single page. This puts us over the SEC flush limit,
so we then immediately flush all but one of them (probably triggering
purging).
- The HPA was using the default PAI batch alloc implementation, which meant it
did not actually get any locking advantages.
This snuck by because I did all the performance testing without using the PAI
interface or config settings. When I cleaned it up and put everything behind
nice interfaces, I only did correctness checks, and didn't try any performance
ones.
Before this change, purge/hugify decisions had several sharp edges that could
lead to pathological behavior if tuning parameters weren't carefully chosen.
It's the first of a series; this introduces basic "make every hugepage with
dirty pages purgeable" functionality, and the next commit expands that
functionality to have a smarter policy for picking hugepages to purge.
Previously, the dehugify logic would *never* dehugify a hugepage unless it was
dirtier than the dehugification threshold. This can lead to situations in which
these pages (which themselves could never be purged) would push us above the
maximum allowed dirty pages in the shard. This forces immediate purging of any
pages deallocated in non-hugified hugepages, which in turn places nonobvious
practical limitations on the relationships between various config settings.
Instead, we make our preference not to dehugify to purge a soft one rather than
a hard one. We'll avoid purging them, but only so long as we can do so by
purging non-hugified pages. If we need to purge them to satisfy our dirty page
limits, or to hugify other, more worthy candidates, we'll still do so.
Currently that just means max_alloc, but we're about to add more. While we're
touching these lines anyways, tweak things to be more in line with testing.
This finishes the refactoring of the HPA/psset interactions the past few commits
have been building towards.
Rather than the HPA removing and then reinserting hpdatas, it simply begins
updates and ends them. These updates can set flags on the hpdata that prevent
it from being returned for certain types of requests. For example, it can call
hpdata_alloc_allowed_set(hpdata, false) during an update, at which point the
given hpdata will no longer be returned for psset_pick_alloc requests.
This has various of benefits:
- It maintains stats correctness during purges and hugifies.
- It allows simpler and more explicit concurrency control for the various
special cases (e.g. allocations are disallowed during purge, but not during
hugify).
- It lets allocations and deallocations avoid disturbing the purging and
hugification orderings. If an hpdata "loses its place" in one of the queues
just do to an alloc / dalloc, it can result in pathological edge cases where
very hot, very full hugepages never get hugified (and cold extents on the
same hugepage as hot ones never get purged).
The key benefit though is that tracking hpdatas to be purged / hugified in a
principled way will let us do delayed purging and hugification. Eventually this
will let us move these operations to background threads, but in the short term
the benefit is that it will let us have global purging policies (e.g. purge when
the entire arena has too many dirty pages, rather than any particular hugepage).
We're moving towards a world in which purging decisions are less rigidly
enforced at a single-hugepage level. In that world, it makes sense to keep
around some hpdatas which are not completely purged, in which case we'll need to
track them.
Really, this isn't a functional change, just a naming change. We start thinking
of pageslabs as being always in the psset. What we used to think of as removal
is now thought of as being in the psset, but in the process of being updated
(and therefore, unavalable for serving new allocations).
This is in preparation of subsequent changes to support deferred purging;
allocations will still be in the psset for the purposes of choosing when to
purge, but not for purposes of allocation/deallocation.
This is really only useful for human consumption. Correspondingly, emit it only
in the human-readable stats, and let everybody else compute from the hugepage
size and nactive.
Previously, we would purge a hugepage only when it's completely empty. With
this change, we can purge even when only partially empty. Although the
heuristic here is still fairly primitive, this infrastructure can scale to
become more advanced.
Using an edata_t both for hugepages and the allocations within those hugepages
was convenient at first, but has outlived its usefulness. Representing
hugepages explicitly, with their own data structure, will make future
development easier.