Really, this isn't a functional change, just a naming change. We start thinking
of pageslabs as being always in the psset. What we used to think of as removal
is now thought of as being in the psset, but in the process of being updated
(and therefore, unavalable for serving new allocations).
This is in preparation of subsequent changes to support deferred purging;
allocations will still be in the psset for the purposes of choosing when to
purge, but not for purposes of allocation/deallocation.
This is really only useful for human consumption. Correspondingly, emit it only
in the human-readable stats, and let everybody else compute from the hugepage
size and nactive.
Previously, we would purge a hugepage only when it's completely empty. With
this change, we can purge even when only partially empty. Although the
heuristic here is still fairly primitive, this infrastructure can scale to
become more advanced.
The items we pick to flush matter a lot, but the order in which they get flushed
doesn't; just use forward scans. This simplifies the accessing code, both in
terms of the C and the generated assembly (i.e. this speeds up the flush
pathways).
By carefully force-inlining the division constants and the operation sum count,
we can eliminate redundant operations in the arena-level dalloc function. Do
so.
qemu does not support this, yet [1], and you can get very tricky assert
if you will run program with jemalloc in use under qemu:
<jemalloc>: ../contrib/jemalloc/src/extent.c:1195: Failed assertion: "p[i] == 0"
[1]: https://patchwork.kernel.org/patch/10576637/
Here is a simple example that shows the problem [2]:
// Gist to check possible issues with MADV_DONTNEED
// For example it does not supported by qemu user
// There is a patch for this [1], but it hasn't been applied.
// [1]: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05422.html
#include <sys/mman.h>
#include <stdio.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
int main(int argc, char **argv)
{
void *addr = mmap(NULL, 1<<16, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 'A', 1<<16);
if (!madvise(addr, 1<<16, MADV_DONTNEED)) {
puts("MADV_DONTNEED does not return error. Check memory.");
for (int i = 0; i < 1<<16; ++i) {
assert(((unsigned char *)addr)[i] == 0);
}
} else {
perror("madvise");
}
if (munmap(addr, 1<<16)) {
perror("munmap");
return 1;
}
return 0;
}
### unpatched qemu
$ qemu-x86_64-static /tmp/test-MADV_DONTNEED
MADV_DONTNEED does not return error. Check memory.
test-MADV_DONTNEED: /tmp/test-MADV_DONTNEED.c:19: main: Assertion `((unsigned char *)addr)[i] == 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted (core dumped)
### patched qemu (by returning ENOSYS error)
$ qemu-x86_64 /tmp/test-MADV_DONTNEED
madvise: Success
### patch for qemu to return ENOSYS
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 897d20c076..5540792e0e 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -11775,7 +11775,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
turns private file-backed mappings into anonymous mappings.
This will break MADV_DONTNEED.
This is a hint, so ignoring and returning success is ok. */
- return 0;
+ return ENOSYS;
#endif
#ifdef TARGET_NR_fcntl64
case TARGET_NR_fcntl64:
[2]: https://gist.github.com/azat/12ba2c825b710653ece34dba7f926ece
v2:
- review fixes
- add opt_dont_trust_madvise
v3:
- review fixes
- rename opt_dont_trust_madvise to opt_trust_madvise
This fixes an incorrect debug-mode assert:
- T1 starts an arena stats update and reads stack_head from another thread's
cache bin, when that cache bin has 1 item in it.
- T2 allocates from that cache bin. The cache_bin's stack_head now points to a
NULL pointer, since the cache bin is empty.
- T1 Re-reads the cache_bin's stack_head to perform an assertion check (since it
previously saw that the bin was empty, whatever stack_head points to should be
non-NULL).
We do not fail on partial ctl path when the given `mib` array is
shorter than the given name, and we should keep the behavior the
same in the reverse case, which I feel is also the more natural way.
This is no longer part of the "core" functionality; we only need the stub
implementations as an end-to-end test of hpdata + psset interactions when
metadata is being modified. Treat them accordingly.
Using an edata_t both for hugepages and the allocations within those hugepages
was convenient at first, but has outlived its usefulness. Representing
hugepages explicitly, with their own data structure, will make future
development easier.
This was promised in the review of the introduction of geom_grow, but would have
been painful to do there because of the series that introduced it. Now that
those are comitted, renaming is easier.
In previous designs, this was intended to be a sort of cache that couldn't fail.
In the current design, we want to use it just as a contention reduction
mechanism. Rewrite it with those goals in mind.
This (experimental, undocumented) functionality can be used by users to track
various statistics of interest at a finer level of granularity than the thread.
Previously all the small size classes were cached. However this has downsides
-- particularly when page size is greater than 4K (e.g. iOS), which will result
in much higher SMALL_MAXCLASS.
This change allows tcache_max to be set to lower values, to better control
resources taken by tcache.
This functions more like the serial number strategy of the ecache and
hpa_central_t. Longer-lived slabs are more likely to continue to live for
longer in the future.
This will be the centralized component of the coming hugepage allocator; the
source of larger chunks of memory from which smaller ones can be obtained.
These had no uses and complicated the API. As a rule we now expect to only use
thread-local randomization for contention-reduction reasons, so we only pay the
API costs and never get the functionality benefits.
This introduces a new sort of edata_t; a pageslab, and a set to manage them.
This is part of a series of a commits to implement a hugepage allocator; the
pageset will be per-arena, and track small page allocations requests within a
larger extent allocated from a centralized hugepage allocator.
The existing checks are good at finding such issues (on tcache flush), but not
so good at pinpointing them. Debug mode can find them, but sometimes debug mode
slows down a program so much that hard-to-hit bugs can take a long time to
crash.
This commit adds functionality to keep programs mostly on their fast paths,
while also checking every sized delete argument they get.
These simplify a lot of the bit_util module, which had grown bits and pieces of
this functionality across a variety of places over the years.
While we're here, kill off BIT_UTIL_INLINE and don't do reentrancy testing for
bit_util.
For now, this is just a stub containing the ecaches, with no surrounding code
changed. Eventually all the core allocator bits will be moved in, in the
subsequent stack of commits.
The goal of `qr_meld()` is to change the following four fields
`(a->prev, a->prev->next, b->prev, b->prev->next)` from the values
`(a->prev, a, b->prev, b)` to `(b->prev, b, a->prev, a)`.
This commit changes
```
a->prev->next = b;
b->prev->next = a;
temp = a->prev;
a->prev = b->prev;
b->prev = temp;
```
to
```
temp = a->prev;
a->prev = b->prev;
b->prev = temp;
a->prev->next = a;
b->prev->next = b;
```
The benefit is that we can use `b->prev->next` for `temp`, and so
there's no need to pass in `a_type`.
The restriction is that `b` cannot be a `qr_next()` macro, so users
of `qr_meld()` must pay attention. (Before this change, neither `a`
nor `b` could be a `qr_next()` macro.)
Previously, large allocations in tcaches would have their sizes reduced during
stats estimation. Added a test, which fails before this change but passes now.
This fixes a bug introduced in 5934846612, which
was itself fixing a bug introduced in 9c0549007d.
This lets us put more allocations on an "almost as fast" path after a flush.
This results in around a 4% reduction in malloc cycles in prod workloads
(corresponding to about a 0.1% reduction in overall cycles).
Previously, we took an array of cache_bin_info_ts and an index, and dereferenced
ourselves. But infos for other cache_bins aren't relevant to any particular
cache bin, so that should be the caller's job.
This is debug only and we keep it off the fast path. Moving it here simplifies
the internal logic.
This never tries to junk on regions that were shrunk via xallocx. I think this
is fine for two reasons:
- The shrunk-with-xallocx case is rare.
- We don't always do that anyway before this diff (it depends on the opt
settings and extent hooks in effect).
Make the event module to accept two event types, and pass around the event
context. Use bytes-based events to trigger tcache GC on deallocation, and get
rid of the tcache ticker.
Add options stats_interval and stats_interval_opts to allow interval based stats
printing. This provides an easy way to collect stats without code changes,
because opt.stats_print may not work (some binaries never exit).