Looking at the thread counts in our services, jemalloc's background thread
is useful, but mostly idle. Add a config option to tune down the number of threads.
The emitter can be used to produce structured json or tabular output. For now
it has no uses; in subsequent commits, I'll begin transitioning stats printing
code over.
"always" marks all user mappings as MADV_HUGEPAGE; while "never" marks all
mappings as MADV_NOHUGEPAGE. The default setting "default" does not change any
settings. Note that all the madvise calls are part of the default extent hooks
by design, so that customized extent hooks have complete control over the
mappings including hugepage settings.
We have a buffer overrun that manifests in the case where arena indices higher
than the number of CPUs are accessed before arena indices lower than the number
of CPUs. This fixes the bug and adds a test.
When allocating from dirty extents (which we always prefer if available), large
active extents can get split even if the new allocation is much smaller, in
which case the introduced fragmentation causes high long term damage. This new
option controls the threshold to reuse and split an existing active extent. We
avoid using a large extent for much smaller sizes, in order to reduce
fragmentation. In some workload, adding the threshold improves virtual memory
usage by >10x.
This option controls the max size when grow_retained. This is useful when we
have customized extent hooks reserving physical memory (e.g. 1G huge pages).
Without this feature, the default increasing sequence could result in fragmented
and wasted physical memory.
To avoid the high RSS caused by THP + low usage arena (i.e. THP becomes a
significant percentage), added a new "auto" option which will only start using
THP after a base allocator used up the first THP region. Starting from the
second hugepage (in a single arena), "auto" behaves the same as "always",
i.e. madvise hugepage right away.
As part of the metadata_thp support, We now have a separate swtich
(JEMALLOC_HAVE_MADVISE_HUGE) for MADV_HUGEPAGE availability. Use that instead
of JEMALLOC_THP (which doesn't guard pages_huge anymore) in tests.
Currently we have to log by writing something like:
static log_var_t log_a_b_c = LOG_VAR_INIT("a.b.c");
log (log_a_b_c, "msg");
This is sort of annoying. Let's just write:
log("a.b.c", "msg");
Currently, the log macro requires at least one argument after the format string,
because of the way the preprocessor handles varargs macros. We can hide some of
that irritation by pushing the extra arguments into a varargs function.
Forking a multithreaded process is dangerous but allowed, so long as the child
only executes async-signal-safe functions (e.g. exec). Add a test to ensure
that we don't break this behavior.
Add testing for background_thread:true, and condition a xallocx() -->
rallocx() escalation assertion to allow for spurious in-place rallocx()
following xallocx() failure.
Added opt.background_thread to enable background threads, which handles purging
currently. When enabled, decay ticks will not trigger purging (which will be
left to the background threads). We limit the max number of threads to NCPUs.
When percpu arena is enabled, set CPU affinity for the background threads as
well.
The sleep interval of background threads is dynamic and determined by computing
number of pages to purge in the future (based on backlog).
Instead of embedding a lock bit in rtree leaf elements, we associate extents
with a small set of mutexes. This gets us two things:
- We can use the system mutexes. This (hypothetically) protects us from
priority inversion, and lets us stop doing a backoff/sleep loop, instead
opting for precise wakeups from the mutex.
- Cuts down on the number of mutex acquisitions we have to do (from 4 in the
worst case to two).
We end up simplifying most of the rtree code (which no longer has to deal with
locking or concurrency at all), at the cost of additional complexity in the
extent code: since the mutex protecting the rtree leaf elements is determined by
reading the extent out of those elements, the initial read is racy, so that we
may acquire an out of date mutex. We re-check the extent in the leaf after
acquiring the mutex to protect us from this race.
Support millisecond resolution for decay times. Among other use cases
this makes it possible to specify a short initial dirty-->muzzy decay
phase, followed by a longer muzzy-->clean decay phase.
This resolves#812.
This removes the tsd macros (which are used only for tsd_t in real builds). We
break up the circular dependencies involving tsd.
We also move all tsd access through getters and setters. This allows us to
assert that we only touch data when tsd is in a valid state.
We simplify the usages of the x macro trick, removing all the customizability
(get/set, init, cleanup), moving the lifetime logic to tsd_init and tsd_cleanup.
This lets us make initialization order independent of order within tsd_t.
Add the extent_destroy_t extent destruction hook to extent_hooks_t, and
use it during arena destruction. This hook explicitly communicates to
the callee that the extent must be destroyed or tracked for later reuse,
lest it be permanently leaked. Prior to this change, retained extents
could unintentionally be leaked if extent retention was enabled.
This resolves#560.
Control use of munmap(2) via a run-time option rather than a
compile-time option (with the same per platform default). The old
behavior of --disable-munmap can be achieved with
--with-malloc-conf=munmap:false.
This partially resolves#580.