Due to a bug in sec initialization, the number of cached size classes
was equal to 198. The bug caused the creation of more than a hundred of
unused bins, although it didn't affect the caching logic.
Before this commit, in case FreeBSD libc jemalloc was overridden by another
jemalloc, proper thread shutdown callback was involved only for the overriding
jemalloc. A call to _malloc_thread_cleanup from libthr would be redirected to
user jemalloc, leaving data about dead threads hanging in system jemalloc. This
change tackles the issue in two ways. First, for current and old system
jemallocs, which we can not modify, the overriding jemalloc would locate and
invoke system cleanup routine. For upcoming jemalloc integrations, the cleanup
registering function will also be redirected to user jemalloc, which means that
system jemalloc's cleanup routine will be registered in user's jemalloc and a
single call to _malloc_thread_cleanup will be sufficient to invoke both
callbacks.
While calculating the number of stashed pointers, multiple variables
potentially modified by a concurrent thread were used for the
calculation. This led to some inconsistencies, correctly detected by
the assertions. The change eliminates some possible inconsistencies by
using unmodified variables and only once a concurrently modified one.
The assertions are omitted for the cases where we acknowledge potential
inconsistencies too.
The option makes the process to exit with error code 1 if a memory leak
is detected. This is useful for implementing automated tools that rely
on leak detection.
With DSS as primary, the default merge impl will (correctly) decline to merge
when one of the extent is non-dss. The error path should tolerate the
not-merged extent being in a merging state.
On deallocation, sampled pointers (specially aligned) get junked and stashed
into tcache (to prevent immediate reuse). The expected behavior is to have
read-after-free corrupted and stopped by the junk-filling, while
write-after-free is checked when flushing the stashed pointers.
Also refactor the handling of the non-deterministic case. Notably allow the
case with narenas set to proceed w/o warnings, to not affect existing valid use
cases.
nstime module guarantees monotonic clock update within a single nstime_t. This
means, if two separate nstime_t variables are read and updated separately,
nstime_subtract between them may result in underflow. Fixed by switching to the
time since utility provided by nstime.
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: https://github.com/lxc/lxcfs/issues/301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: https://github.com/jemalloc/jemalloc/pull/1939
Refs: https://github.com/ClickHouse/ClickHouse/issues/32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Currently used only for guarding purposes, the hint is used to determine
if the allocation is supposed to be frequently reused. For example, it
might urge the allocator to ensure the allocation is cached.
Some nstime_t operations require and assume the input nstime is initialized
(e.g. nstime_update) -- uninitialized input may cause silent failures which is
difficult to reproduce / debug. Add an explicit flag to track the state
(limited to debug build only).
Also fixed an use case in hpa (time of last_purge).
In order for nstime_update to handle non-monotonic clocks, it requires the input
nstime to be initialized -- when reading for the first time, zero init has to be
done. Otherwise random stack value may be seen as clocks and returned.
The event counters maintain a relationship with the current bytes: last_event <=
current < next_event. When a reinit happens (e.g. reincarnated tsd), the last
event needs progressing because all events start fresh from the current bytes.
When opt_retain is on, slab extents remain guarded in all states, even
retained. This works well if arena is never destroyed, because we
anticipate those slabs will be eventually reused. But if the arena is
destroyed, the slabs must be unguarded to prevent leaking guard pages.
As the code evolves, some code paths that have previously assigned
deferred_work_generated may cease being reached. This would leave the value
uninitialized. This change initializes the value for safety.
Adding guarded extents, which are regular extents surrounded by guard pages
(mprotected). To reduce syscalls, small guarded extents are cached as a
separate eset in ecache, and decay through the dirty / muzzy / retained pipeline
as usual.
This mallctl accepts an arena_config_t structure which
can be used to customize the behavior of the arena.
Right now it contains extent_hooks and a new option,
metadata_use_hooks, which controls whether the extent
hooks are also used for metadata allocation.
The medata_use_hooks option has two main use cases:
1. In heterogeneous memory systems, to avoid metadata
being placed on potentially slower memory.
2. Avoiding virtual memory from being leaked as a result
of metadata allocation failure originating in an extent hook.
Existing backtrace implementations skip native stack frames from runtimes like
Python. The hook allows to augment the backtraces to attribute allocations to
native functions in heap profiles.
The prof initialization is done only when opt_prof is true. This change makes
sure the prof_* mallctls only have limited read access (i.e. no access to prof
internals) when opt_prof is false.
In addition, initialize the global prof mutexes even if opt_prof is false. This
makes sure the mutex stats are set properly.
This change allows every allocator conforming to PAI communicate that it
deferred some work for the future. Without it if a background thread goes into
indefinite sleep, there is no way to notify it about upcoming deferred work.
Previously the calculation of sleep time between wakeups was implemented within
background_thread. This resulted in some parts of decay and hpa specific
logic mixing with background thread implementation. In this change, background
thread delegates this calculation to arena and it, in turn, delegates it to PAI.
The next step is to implement the actual calculation of time until deferred work
in HPA.
Specifically, this change allows the default alloc hook to used during
arenas.create. One use case is to invoke the default alloc hook in a customized
hook arena, i.e. the default hooks can be read out of a default arena, then
create customized ones based on these hooks. Note that mixing the default with
customized hooks is not recommended, and should only be considered when the
customization is simple and straightforward.
The recent pairing heap optimizations flattened the lock hold time profile.
This was a win for raw cycle counts, but ended up causing us to "just miss"
acquiring the mutex before sleeping more often. Bump those counts.
By force-inlining everything that would otherwise be a macro, we get the same
effect (it's not clear in the first place that this is actually a good idea, but
it avoids making any changes to the existing performance profile).
This makes the code more maintainable (in anticipation of subsequent changes),
as well as making performance profiles and debug info more readable (we get
"real" line numbers, instead of making everything point to the macro definition
of all associated functions).
The edata_cache_small had a fill/flush heuristic. In retrospect, this was a
premature optimization; more testing indicates that an unbounded cache is
effectively fine here, and moreover we spend a nontrivial amount of time doing
unnecessary filling/flushing.
As the HPA takes on a larger and larger fraction of all allocations, any
theoretical differences in allocation patterns should shrink. The HPA is more
efficient with its metadata in general, so it still comes out ahead on metadata
usage anyways.
We wait a while after deciding a huge extent should get hugified to see if it
gets purged before long. This avoids hugifying extents that might shortly get
dehugified for purging.
Rename and use the hpa_dehugification_threshold option support code for this,
since it's now ignored.
This fixes two simple but significant typos in the HPA:
- The conf string parsing accidentally set a min value of PAGE for
hpa_sec_batch_fill_extra; i.e. allocating 4096 extra pages every time we
attempted to allocate a single page. This puts us over the SEC flush limit,
so we then immediately flush all but one of them (probably triggering
purging).
- The HPA was using the default PAI batch alloc implementation, which meant it
did not actually get any locking advantages.
This snuck by because I did all the performance testing without using the PAI
interface or config settings. When I cleaned it up and put everything behind
nice interfaces, I only did correctness checks, and didn't try any performance
ones.
Hold the ecache lock across extent_recycle_extract() and extent_recycle_split(),
so that the extent_deactivate after split can avoid re-take the ecache mutex.
Now that all merging go through try_acquire_edata_neighbor, the mergeablility
checks (including head state checking) are done before reaching the merge hook.
In other words, merge hook will never be called if the head state doesn't agree.
Instead of passing down the new_addr, pass down the active edata which allows us
to always use a neighbor-acquiring semantic. In other words, this tells us both
the original edata and neighbor address. With this change, only neighbors of a
"known" edata can be acquired, i.e. acquiring an edata based on an arbitrary
address isn't possible anymore.
This avoids the addr-based mutexes (i.e. the mutex_pool), and instead relies on
the metadata tracked in rtree leaf: the head state and extent_state. Before
trying to access the neighbor edata (e.g. for coalescing), the states will be
verified first -- only neighbor edatas from the same arena and with the same
state will be accessed.
When retain is on, when extent_grow_retained failed (e.g. due to split hook
failures), we'll try extent_alloc_wrapper as the last resort. Set the is_head
bit in that case to be consistent. The allocated extent in that case will be
retained properly, but not merged with other extents.
Before this change, purge/hugify decisions had several sharp edges that could
lead to pathological behavior if tuning parameters weren't carefully chosen.
It's the first of a series; this introduces basic "make every hugepage with
dirty pages purgeable" functionality, and the next commit expands that
functionality to have a smarter policy for picking hugepages to purge.
Previously, the dehugify logic would *never* dehugify a hugepage unless it was
dirtier than the dehugification threshold. This can lead to situations in which
these pages (which themselves could never be purged) would push us above the
maximum allowed dirty pages in the shard. This forces immediate purging of any
pages deallocated in non-hugified hugepages, which in turn places nonobvious
practical limitations on the relationships between various config settings.
Instead, we make our preference not to dehugify to purge a soft one rather than
a hard one. We'll avoid purging them, but only so long as we can do so by
purging non-hugified pages. If we need to purge them to satisfy our dirty page
limits, or to hugify other, more worthy candidates, we'll still do so.
It tracks pageslabs. Soon, we'll have another bitmap (to track dirty pages)
that we want to disambiguate.
While we're here, fix an out-of-date comment.