Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: https://github.com/lxc/lxcfs/issues/301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: https://github.com/jemalloc/jemalloc/pull/1939
Refs: https://github.com/ClickHouse/ClickHouse/issues/32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Currently used only for guarding purposes, the hint is used to determine
if the allocation is supposed to be frequently reused. For example, it
might urge the allocator to ensure the allocation is cached.
Some nstime_t operations require and assume the input nstime is initialized
(e.g. nstime_update) -- uninitialized input may cause silent failures which is
difficult to reproduce / debug. Add an explicit flag to track the state
(limited to debug build only).
Also fixed an use case in hpa (time of last_purge).
In order for nstime_update to handle non-monotonic clocks, it requires the input
nstime to be initialized -- when reading for the first time, zero init has to be
done. Otherwise random stack value may be seen as clocks and returned.
The event counters maintain a relationship with the current bytes: last_event <=
current < next_event. When a reinit happens (e.g. reincarnated tsd), the last
event needs progressing because all events start fresh from the current bytes.
When opt_retain is on, slab extents remain guarded in all states, even
retained. This works well if arena is never destroyed, because we
anticipate those slabs will be eventually reused. But if the arena is
destroyed, the slabs must be unguarded to prevent leaking guard pages.
As the code evolves, some code paths that have previously assigned
deferred_work_generated may cease being reached. This would leave the value
uninitialized. This change initializes the value for safety.
Adding guarded extents, which are regular extents surrounded by guard pages
(mprotected). To reduce syscalls, small guarded extents are cached as a
separate eset in ecache, and decay through the dirty / muzzy / retained pipeline
as usual.
This mallctl accepts an arena_config_t structure which
can be used to customize the behavior of the arena.
Right now it contains extent_hooks and a new option,
metadata_use_hooks, which controls whether the extent
hooks are also used for metadata allocation.
The medata_use_hooks option has two main use cases:
1. In heterogeneous memory systems, to avoid metadata
being placed on potentially slower memory.
2. Avoiding virtual memory from being leaked as a result
of metadata allocation failure originating in an extent hook.
Existing backtrace implementations skip native stack frames from runtimes like
Python. The hook allows to augment the backtraces to attribute allocations to
native functions in heap profiles.
The prof initialization is done only when opt_prof is true. This change makes
sure the prof_* mallctls only have limited read access (i.e. no access to prof
internals) when opt_prof is false.
In addition, initialize the global prof mutexes even if opt_prof is false. This
makes sure the mutex stats are set properly.
This change allows every allocator conforming to PAI communicate that it
deferred some work for the future. Without it if a background thread goes into
indefinite sleep, there is no way to notify it about upcoming deferred work.
Previously the calculation of sleep time between wakeups was implemented within
background_thread. This resulted in some parts of decay and hpa specific
logic mixing with background thread implementation. In this change, background
thread delegates this calculation to arena and it, in turn, delegates it to PAI.
The next step is to implement the actual calculation of time until deferred work
in HPA.
Specifically, this change allows the default alloc hook to used during
arenas.create. One use case is to invoke the default alloc hook in a customized
hook arena, i.e. the default hooks can be read out of a default arena, then
create customized ones based on these hooks. Note that mixing the default with
customized hooks is not recommended, and should only be considered when the
customization is simple and straightforward.
The recent pairing heap optimizations flattened the lock hold time profile.
This was a win for raw cycle counts, but ended up causing us to "just miss"
acquiring the mutex before sleeping more often. Bump those counts.
By force-inlining everything that would otherwise be a macro, we get the same
effect (it's not clear in the first place that this is actually a good idea, but
it avoids making any changes to the existing performance profile).
This makes the code more maintainable (in anticipation of subsequent changes),
as well as making performance profiles and debug info more readable (we get
"real" line numbers, instead of making everything point to the macro definition
of all associated functions).
The edata_cache_small had a fill/flush heuristic. In retrospect, this was a
premature optimization; more testing indicates that an unbounded cache is
effectively fine here, and moreover we spend a nontrivial amount of time doing
unnecessary filling/flushing.
As the HPA takes on a larger and larger fraction of all allocations, any
theoretical differences in allocation patterns should shrink. The HPA is more
efficient with its metadata in general, so it still comes out ahead on metadata
usage anyways.
We wait a while after deciding a huge extent should get hugified to see if it
gets purged before long. This avoids hugifying extents that might shortly get
dehugified for purging.
Rename and use the hpa_dehugification_threshold option support code for this,
since it's now ignored.
This fixes two simple but significant typos in the HPA:
- The conf string parsing accidentally set a min value of PAGE for
hpa_sec_batch_fill_extra; i.e. allocating 4096 extra pages every time we
attempted to allocate a single page. This puts us over the SEC flush limit,
so we then immediately flush all but one of them (probably triggering
purging).
- The HPA was using the default PAI batch alloc implementation, which meant it
did not actually get any locking advantages.
This snuck by because I did all the performance testing without using the PAI
interface or config settings. When I cleaned it up and put everything behind
nice interfaces, I only did correctness checks, and didn't try any performance
ones.
Hold the ecache lock across extent_recycle_extract() and extent_recycle_split(),
so that the extent_deactivate after split can avoid re-take the ecache mutex.
Now that all merging go through try_acquire_edata_neighbor, the mergeablility
checks (including head state checking) are done before reaching the merge hook.
In other words, merge hook will never be called if the head state doesn't agree.
Instead of passing down the new_addr, pass down the active edata which allows us
to always use a neighbor-acquiring semantic. In other words, this tells us both
the original edata and neighbor address. With this change, only neighbors of a
"known" edata can be acquired, i.e. acquiring an edata based on an arbitrary
address isn't possible anymore.
This avoids the addr-based mutexes (i.e. the mutex_pool), and instead relies on
the metadata tracked in rtree leaf: the head state and extent_state. Before
trying to access the neighbor edata (e.g. for coalescing), the states will be
verified first -- only neighbor edatas from the same arena and with the same
state will be accessed.
When retain is on, when extent_grow_retained failed (e.g. due to split hook
failures), we'll try extent_alloc_wrapper as the last resort. Set the is_head
bit in that case to be consistent. The allocated extent in that case will be
retained properly, but not merged with other extents.
Before this change, purge/hugify decisions had several sharp edges that could
lead to pathological behavior if tuning parameters weren't carefully chosen.
It's the first of a series; this introduces basic "make every hugepage with
dirty pages purgeable" functionality, and the next commit expands that
functionality to have a smarter policy for picking hugepages to purge.
Previously, the dehugify logic would *never* dehugify a hugepage unless it was
dirtier than the dehugification threshold. This can lead to situations in which
these pages (which themselves could never be purged) would push us above the
maximum allowed dirty pages in the shard. This forces immediate purging of any
pages deallocated in non-hugified hugepages, which in turn places nonobvious
practical limitations on the relationships between various config settings.
Instead, we make our preference not to dehugify to purge a soft one rather than
a hard one. We'll avoid purging them, but only so long as we can do so by
purging non-hugified pages. If we need to purge them to satisfy our dirty page
limits, or to hugify other, more worthy candidates, we'll still do so.
It tracks pageslabs. Soon, we'll have another bitmap (to track dirty pages)
that we want to disambiguate.
While we're here, fix an out-of-date comment.
This change pulls the SEC options into a struct, which simplifies their handling
across various modules (e.g. PA needs to forward on SEC options from the
malloc_conf string, but it doesn't really need to know their names). While
we're here, make some of the fixed constants configurable, and unify naming from
the configuration options to the internals.
Currently that just means max_alloc, but we're about to add more. While we're
touching these lines anyways, tweak things to be more in line with testing.
This finishes the refactoring of the HPA/psset interactions the past few commits
have been building towards.
Rather than the HPA removing and then reinserting hpdatas, it simply begins
updates and ends them. These updates can set flags on the hpdata that prevent
it from being returned for certain types of requests. For example, it can call
hpdata_alloc_allowed_set(hpdata, false) during an update, at which point the
given hpdata will no longer be returned for psset_pick_alloc requests.
This has various of benefits:
- It maintains stats correctness during purges and hugifies.
- It allows simpler and more explicit concurrency control for the various
special cases (e.g. allocations are disallowed during purge, but not during
hugify).
- It lets allocations and deallocations avoid disturbing the purging and
hugification orderings. If an hpdata "loses its place" in one of the queues
just do to an alloc / dalloc, it can result in pathological edge cases where
very hot, very full hugepages never get hugified (and cold extents on the
same hugepage as hot ones never get purged).
The key benefit though is that tracking hpdatas to be purged / hugified in a
principled way will let us do delayed purging and hugification. Eventually this
will let us move these operations to background threads, but in the short term
the benefit is that it will let us have global purging policies (e.g. purge when
the entire arena has too many dirty pages, rather than any particular hugepage).