This is a simple multi-producer, single-consumer queue. The intended use case
is in the HPA, as we begin supporting hpdatas that move between hpa_shards. We
take just a single CAS as the cost to send a message (or a batch of messages) in
the low-contention case, and lock-freedom lets us avoid some lock-ordering
issues.
Now that all merging go through try_acquire_edata_neighbor, the mergeablility
checks (including head state checking) are done before reaching the merge hook.
In other words, merge hook will never be called if the head state doesn't agree.
Instead of passing down the new_addr, pass down the active edata which allows us
to always use a neighbor-acquiring semantic. In other words, this tells us both
the original edata and neighbor address. With this change, only neighbors of a
"known" edata can be acquired, i.e. acquiring an edata based on an arbitrary
address isn't possible anymore.
This avoids the addr-based mutexes (i.e. the mutex_pool), and instead relies on
the metadata tracked in rtree leaf: the head state and extent_state. Before
trying to access the neighbor edata (e.g. for coalescing), the states will be
verified first -- only neighbor edatas from the same arena and with the same
state will be accessed.
When retain is on, when extent_grow_retained failed (e.g. due to split hook
failures), we'll try extent_alloc_wrapper as the last resort. Set the is_head
bit in that case to be consistent. The allocated extent in that case will be
retained properly, but not merged with other extents.
Before this change, purge/hugify decisions had several sharp edges that could
lead to pathological behavior if tuning parameters weren't carefully chosen.
It's the first of a series; this introduces basic "make every hugepage with
dirty pages purgeable" functionality, and the next commit expands that
functionality to have a smarter policy for picking hugepages to purge.
Previously, the dehugify logic would *never* dehugify a hugepage unless it was
dirtier than the dehugification threshold. This can lead to situations in which
these pages (which themselves could never be purged) would push us above the
maximum allowed dirty pages in the shard. This forces immediate purging of any
pages deallocated in non-hugified hugepages, which in turn places nonobvious
practical limitations on the relationships between various config settings.
Instead, we make our preference not to dehugify to purge a soft one rather than
a hard one. We'll avoid purging them, but only so long as we can do so by
purging non-hugified pages. If we need to purge them to satisfy our dirty page
limits, or to hugify other, more worthy candidates, we'll still do so.
It tracks pageslabs. Soon, we'll have another bitmap (to track dirty pages)
that we want to disambiguate.
While we're here, fix an out-of-date comment.
This change pulls the SEC options into a struct, which simplifies their handling
across various modules (e.g. PA needs to forward on SEC options from the
malloc_conf string, but it doesn't really need to know their names). While
we're here, make some of the fixed constants configurable, and unify naming from
the configuration options to the internals.
Currently that just means max_alloc, but we're about to add more. While we're
touching these lines anyways, tweak things to be more in line with testing.
This finishes the refactoring of the HPA/psset interactions the past few commits
have been building towards.
Rather than the HPA removing and then reinserting hpdatas, it simply begins
updates and ends them. These updates can set flags on the hpdata that prevent
it from being returned for certain types of requests. For example, it can call
hpdata_alloc_allowed_set(hpdata, false) during an update, at which point the
given hpdata will no longer be returned for psset_pick_alloc requests.
This has various of benefits:
- It maintains stats correctness during purges and hugifies.
- It allows simpler and more explicit concurrency control for the various
special cases (e.g. allocations are disallowed during purge, but not during
hugify).
- It lets allocations and deallocations avoid disturbing the purging and
hugification orderings. If an hpdata "loses its place" in one of the queues
just do to an alloc / dalloc, it can result in pathological edge cases where
very hot, very full hugepages never get hugified (and cold extents on the
same hugepage as hot ones never get purged).
The key benefit though is that tracking hpdatas to be purged / hugified in a
principled way will let us do delayed purging and hugification. Eventually this
will let us move these operations to background threads, but in the short term
the benefit is that it will let us have global purging policies (e.g. purge when
the entire arena has too many dirty pages, rather than any particular hugepage).
We're moving towards a world in which purging decisions are less rigidly
enforced at a single-hugepage level. In that world, it makes sense to keep
around some hpdatas which are not completely purged, in which case we'll need to
track them.
Really, this isn't a functional change, just a naming change. We start thinking
of pageslabs as being always in the psset. What we used to think of as removal
is now thought of as being in the psset, but in the process of being updated
(and therefore, unavalable for serving new allocations).
This is in preparation of subsequent changes to support deferred purging;
allocations will still be in the psset for the purposes of choosing when to
purge, but not for purposes of allocation/deallocation.
This is really only useful for human consumption. Correspondingly, emit it only
in the human-readable stats, and let everybody else compute from the hugepage
size and nactive.
Previously, we would purge a hugepage only when it's completely empty. With
this change, we can purge even when only partially empty. Although the
heuristic here is still fairly primitive, this infrastructure can scale to
become more advanced.
This saves us a cache miss when lookup up the arena bin offset in a remote
arena during tcache flush. All arenas share the base offset, and so we don't
need to look it up repeatedly for each arena. Secondarily, it shaves 288 bytes
off the arena on, e.g., x86-64.
The items we pick to flush matter a lot, but the order in which they get flushed
doesn't; just use forward scans. This simplifies the accessing code, both in
terms of the C and the generated assembly (i.e. this speeds up the flush
pathways).
By carefully force-inlining the division constants and the operation sum count,
we can eliminate redundant operations in the arena-level dalloc function. Do
so.
This frontloads more of the miss latency. It also moves it to a pathway where
we have not yet acquired any locks, so that it should (hopefully) reduce hold
times.
In practice, many rtree_leaf_elm accesses are cache misses. By restructuring,
we can make it more likely that these misses occur without blocking us from
starting later lookups, taking more of those misses in parallel.
qemu does not support this, yet [1], and you can get very tricky assert
if you will run program with jemalloc in use under qemu:
<jemalloc>: ../contrib/jemalloc/src/extent.c:1195: Failed assertion: "p[i] == 0"
[1]: https://patchwork.kernel.org/patch/10576637/
Here is a simple example that shows the problem [2]:
// Gist to check possible issues with MADV_DONTNEED
// For example it does not supported by qemu user
// There is a patch for this [1], but it hasn't been applied.
// [1]: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05422.html
#include <sys/mman.h>
#include <stdio.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
int main(int argc, char **argv)
{
void *addr = mmap(NULL, 1<<16, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 'A', 1<<16);
if (!madvise(addr, 1<<16, MADV_DONTNEED)) {
puts("MADV_DONTNEED does not return error. Check memory.");
for (int i = 0; i < 1<<16; ++i) {
assert(((unsigned char *)addr)[i] == 0);
}
} else {
perror("madvise");
}
if (munmap(addr, 1<<16)) {
perror("munmap");
return 1;
}
return 0;
}
### unpatched qemu
$ qemu-x86_64-static /tmp/test-MADV_DONTNEED
MADV_DONTNEED does not return error. Check memory.
test-MADV_DONTNEED: /tmp/test-MADV_DONTNEED.c:19: main: Assertion `((unsigned char *)addr)[i] == 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted (core dumped)
### patched qemu (by returning ENOSYS error)
$ qemu-x86_64 /tmp/test-MADV_DONTNEED
madvise: Success
### patch for qemu to return ENOSYS
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 897d20c076..5540792e0e 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -11775,7 +11775,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
turns private file-backed mappings into anonymous mappings.
This will break MADV_DONTNEED.
This is a hint, so ignoring and returning success is ok. */
- return 0;
+ return ENOSYS;
#endif
#ifdef TARGET_NR_fcntl64
case TARGET_NR_fcntl64:
[2]: https://gist.github.com/azat/12ba2c825b710653ece34dba7f926ece
v2:
- review fixes
- add opt_dont_trust_madvise
v3:
- review fixes
- rename opt_dont_trust_madvise to opt_trust_madvise
This fixes an incorrect debug-mode assert:
- T1 starts an arena stats update and reads stack_head from another thread's
cache bin, when that cache bin has 1 item in it.
- T2 allocates from that cache bin. The cache_bin's stack_head now points to a
NULL pointer, since the cache bin is empty.
- T1 Re-reads the cache_bin's stack_head to perform an assertion check (since it
previously saw that the bin was empty, whatever stack_head points to should be
non-NULL).
pthread_key_create on QNX triggers recursive allocation during tsd
bootstrapping. Using tsd_init_check_recursion to detect that.
Before pthread_key_create, the address of tsd_boot_wrapper is returned
from tsd_get_wrapper instead of using TLS to store the pointer.
tsd_set_wrapper becomes a no-op. After that, the address of
tsd_boot_wrapper is written to TLS and bootstrap continues as before.
Signed-off-by: Jin Qian <jqian@aurora.tech>
Now that we have flat bitmap bit counting functions, we can easily assert that
nfree is always correct. While we're tightening up this code, enforce
consistency on API boundaries as well.
This is no longer part of the "core" functionality; we only need the stub
implementations as an end-to-end test of hpdata + psset interactions when
metadata is being modified. Treat them accordingly.
Using an edata_t both for hugepages and the allocations within those hugepages
was convenient at first, but has outlived its usefulness. Representing
hugepages explicitly, with their own data structure, will make future
development easier.
This was promised in the review of the introduction of geom_grow, but would have
been painful to do there because of the series that introduced it. Now that
those are comitted, renaming is easier.
At least one libc (musl) defines pthread_setname_np without defining
pthread_getname_np. Detect the presence of each individually, rather than
inferring both must be defined if set is.
In previous designs, this was intended to be a sort of cache that couldn't fail.
In the current design, we want to use it just as a contention reduction
mechanism. Rewrite it with those goals in mind.
This (experimental, undocumented) functionality can be used by users to track
various statistics of interest at a finer level of granularity than the thread.
Previously all the small size classes were cached. However this has downsides
-- particularly when page size is greater than 4K (e.g. iOS), which will result
in much higher SMALL_MAXCLASS.
This change allows tcache_max to be set to lower values, to better control
resources taken by tcache.
This functions more like the serial number strategy of the ecache and
hpa_central_t. Longer-lived slabs are more likely to continue to live for
longer in the future.
For locality reasons, tcache bins are integrated in TSD. Allowing all size
classes to be cached has little benefit, but takes up much thread local storage.
In addition, it complicates the layout which we try hard to optimize.
This will be the centralized component of the coming hugepage allocator; the
source of larger chunks of memory from which smaller ones can be obtained.
These had no uses and complicated the API. As a rule we now expect to only use
thread-local randomization for contention-reduction reasons, so we only pay the
API costs and never get the functionality benefits.
This introduces a new sort of edata_t; a pageslab, and a set to manage them.
This is part of a series of a commits to implement a hugepage allocator; the
pageset will be per-arena, and track small page allocations requests within a
larger extent allocated from a centralized hugepage allocator.
Specify the maximum number of regions in a slab, which is
(<lg-page> - <lg-tiny-min>) by default. This increases the limit of slab sizes
specified by "slab_sizes" in malloc_conf. This should never be less than
the default value. The max value of this option is related to LG_BITMAP_MAXBITS
(see more in bitmap.h).
For example, on a 4k page size system, if we:
1) configure jemalloc with with --with-lg-slab-maxregs=12.
2) export MALLOC_CONF="slab_sizes:9-16:4"
The slab size of 16 bytes is set to 4 pages. Previously, the default
lg-slab-maxregs is 9 (i.e. 12 - 3). The max slab size of 16 bytes is 2 pages
(i.e. (1<<9) * 16 bytes). By increasing the value from 9 to 12, the max slab
size can be set by MALLOC_CONF is 16 pages (i.e. (1<<12) * 16 bytes).
The commit introducing size checks accidentally enabled them whenever any safety
checks were on. This ends up causing the regression that splitting up the
features was intended to avoid. Fix the issue.
The existing checks are good at finding such issues (on tcache flush), but not
so good at pinpointing them. Debug mode can find them, but sometimes debug mode
slows down a program so much that hard-to-hit bugs can take a long time to
crash.
This commit adds functionality to keep programs mostly on their fast paths,
while also checking every sized delete argument they get.