The hook module allows a low-reader-overhead way of finding hooks to invoke and
calling them.
For now, none of the allocation pathways are tied into the hooks; this will come
later.
"Hooks" is really the best name for the module that will contain the publicly
exposed hooks. So lets rename the current "hooks" module (that hook external
dependencies, for reentrancy testing) to "test_hooks".
When configured with --with-lg-page, it's possible for the configured page size
to be greater than the system page size, in which case the page address may only
be aligned with the system page size.
Previously, we would leak the extent and memory associated with a salvageable
portion of an extent that we were trying to split in three, in the case where
the first split attempt succeeded and the second failed.
Looking at the thread counts in our services, jemalloc's background thread
is useful, but mostly idle. Add a config option to tune down the number of threads.
preserve_lru feature adds lots of complication, for little value.
Removing it means merged extents are re-added to the lru list, and may
take longer to madvise away than they otherwise would.
Canaries after removal seem flat for several services (no change).
"always" marks all user mappings as MADV_HUGEPAGE; while "never" marks all
mappings as MADV_NOHUGEPAGE. The default setting "default" does not change any
settings. Note that all the madvise calls are part of the default extent hooks
by design, so that customized extent hooks have complete control over the
mappings including hugepage settings.
We have a buffer overrun that manifests in the case where arena indices higher
than the number of CPUs are accessed before arena indices lower than the number
of CPUs. This fixes the bug and adds a test.
On glibc and Android's bionic, strerror_r returns char* when
_GNU_SOURCE is defined.
Add a configure check for this rather than assume glibc is the
only libc that behaves this way.
We compute the max size required to satisfy an alignment. However this can be
quite pessimistic, especially with frequent reuse (and combined with state-based
fragmentation). This commit adds one more fit step specific to aligned
allocations, searching in all potential fit size classes.
The arena-associated stats are now all prefixed with arena_stats_, and live in
their own file. Likewise, malloc_bin_stats_t -> bin_stats_t, also in its own
file.
When purging, large allocations are usually the ones that cross the npages_limit
threshold, simply because they are "large". This means we often leave the large
extent around for a while, which has the downsides of: 1) high RSS and 2) more
chance of them getting fragmented. Given that they are not likely to be reused
very soon (LRU), let's over purge by 1 extent (which is often large and not
reused frequently).
Coalescing is a small price to pay for large allocations since they happen less
frequently. This reduces fragmentation while also potentially improving
locality.
When allocating from dirty extents (which we always prefer if available), large
active extents can get split even if the new allocation is much smaller, in
which case the introduced fragmentation causes high long term damage. This new
option controls the threshold to reuse and split an existing active extent. We
avoid using a large extent for much smaller sizes, in order to reduce
fragmentation. In some workload, adding the threshold improves virtual memory
usage by >10x.
While working on #852, I noticed the prng state is atomic. This is the only
atomic use of prng in all of jemalloc. Instead, use a threadlocal prng
state if possible to avoid unnecessary cache line contention.
Added an upper bound on how many pages we can decay during the current run.
Without this, decay could have unbounded increase in stashed, since other
threads could add new pages into the extents.
This option controls the max size when grow_retained. This is useful when we
have customized extent hooks reserving physical memory (e.g. 1G huge pages).
Without this feature, the default increasing sequence could result in fragmented
and wasted physical memory.
This attempts to use VM_OVERCOMMIT OID - newly introduced in -CURRENT
few days ago, specifically for this purpose - instead of querying the
sysctl by its string name. Due to how syctlbyname(3) works, this means
we do one syscall during binary startup instead of two.
Signed-off-by: Edward Tomasz Napierala <trasz@FreeBSD.org>
This avoids sysctl(2) syscall during binary startup, using the value
passed in the ELF aux vector instead.
Signed-off-by: Edward Tomasz Napierala <trasz@FreeBSD.org>
We observed that arena 0 can have much more metadata allocated comparing to
other arenas. Tune the auto mode to only switch to huge page on the 5th block
(instead of 3 previously) for a0.
Before this commit, extent_recycle_split intermingles the splitting of an extent
and the return of parts of that extent to a given extents_t. After it, that
logic is separated. This will enable splitting extents that don't live in any
extents_t (as the grow retained region soon will).
On x86 Linux, we define our own MADV_FREE if madvise(2) is available, but no
MADV_FREE is detected. This allows the feature to be built in and enabled with
runtime detection.
Since we allocate rtree nodes from a0's base, it's pushed to over 1 block on
initialization right away, which makes the auto thp mode less effective on a0.
We change a0 to make the switch on the 3rd block instead.
Quoting from https://github.com/jemalloc/jemalloc/issues/761 :
[...] reading the Power ISA documentation[1], the assembly in [the CPU_SPINWAIT
macro] isn't correct anyway (as @marxin points out): the setting of the
program-priority register is "sticky", and we never undo the lowering.
We could do something similar, but given that we don't have testing here in the
first place, I'm inclined to simply not try. I'll put something up reverting the
problematic commit tomorrow.
[1] Book II, chapter 3 of the 2.07B or 3.0B ISA documents.
There does not seem to be any overlap between usage of
extent_avail and extent_heap, so we can use the same hook.
The only remaining usage of rb trees is in the profiling code,
which has some 'interesting' iteration constraints.
Fixes#888
It's possible to build with lazy purge enabled but depoly to systems without
such support. In this case, rely on the boot time detection instead of keep
making unnecessary madvise calls (which all returns EINVAL).
If we guarantee no malloc activity in extent hooks, it's possible to make
customized hooks working on arena 0. Remove the non-a0 assertion to enable such
use cases.
To avoid the high RSS caused by THP + low usage arena (i.e. THP becomes a
significant percentage), added a new "auto" option which will only start using
THP after a base allocator used up the first THP region. Starting from the
second hugepage (in a single arena), "auto" behaves the same as "always",
i.e. madvise hugepage right away.
This eliminates the need for the arena stats code to "know" about tcaches; all
that it needs is a cache_bin_array_descriptor_t to tell it where to find
cache_bins whose stats it should aggregate.
This is the first step towards breaking up the tcache and arena (since they
interact primarily at the bin level). It should also make a future arena
caching implementation more straightforward.
As part of the metadata_thp support, We now have a separate swtich
(JEMALLOC_HAVE_MADVISE_HUGE) for MADV_HUGEPAGE availability. Use that instead
of JEMALLOC_THP (which doesn't guard pages_huge anymore) in tests.
The external linkage for spin_adaptive was not used, and the inline
declaration of spin_adaptive that was used caused a probem on FreeBSD
where CPU_SPINWAIT is implemented as a call to a static procedure for
x86 architectures.
If ptr is not page aligned, we know the allocation was not sampled. In this case
use the size passed into sdallocx directly w/o accessing rtree. This improve
sdallocx efficiency in the common case (not sampled && small allocation).
When retain is enabled, we should not attempt mmap for in-place expansion
(large_ralloc_no_move), because it's virtually impossible to succeed, and causes
unnecessary syscalls (which can cause lock contention under load).
Currently we have to log by writing something like:
static log_var_t log_a_b_c = LOG_VAR_INIT("a.b.c");
log (log_a_b_c, "msg");
This is sort of annoying. Let's just write:
log("a.b.c", "msg");
Currently, the log macro requires at least one argument after the format string,
because of the way the preprocessor handles varargs macros. We can hide some of
that irritation by pushing the extra arguments into a varargs function.
Older Linux systems don't have O_CLOEXEC. If that's the case, we fcntl
immediately after open, to minimize the length of the racy period in
which an
operation in another thread can leak a file descriptor to a child.
On OS X, we rely on the zone machinery to call our prefork and postfork
handlers.
In zone_force_unlock, we call jemalloc_postfork_child, reinitializing all our
mutexes regardless of state, since the mutex implementation will assert if the
tid of the unlocker is different from that of the locker. This has the effect
of unlocking the mutexes, but also fails to wake any threads waiting on them in
the parent.
To fix this, we track whether or not we're the parent or child after the fork,
and unlock or reinit as appropriate.
This resolves#895.
Passing is_background_thread down the decay path, so that background thread
itself won't attempt inactivity_check. This fixes an issue with background
thread doing trylock on a mutex it already owns.
We use the minimal_initilized tsd (which requires no cleanup) for free()
specifically, if tsd hasn't been initialized yet.
Any other activity will transit the state from minimal to normal. This is to
workaround the case where a thread has no malloc calls in its lifetime until
during thread termination, free() happens after tls destructors.
This issue caused the default extent alloc function to be incorrectly
used even when arena.<i>.extent_hooks is set. This bug was introduced
by 411697adcd (Use exponential series to
size extents.), which was first released in 5.0.0.
To avoid complications, avoid invoking pthread_create "internally", instead rely
on thread0 to launch new threads, and also terminating threads when asked.
Avoid holding arenas_lock and background_thread_lock when creating background
threads, because pthread_create may take internal locks, and potentially cause
deadlock with jemalloc internal locks.