W/o retain, split and merge are disallowed on Windows. Avoid doing first-fit
which needs splitting almost always. Instead, try exact fit only and bail out
early.
Refactored core profiling codebase into two logical parts:
(a) `prof_data.c`: core internal data structure managing & dumping;
(b) `prof.c`: mutexes & outward-facing APIs.
Some internal functions had to be exposed out, but there are not
that many of them if the modularization is (hopefully) clean enough.
`prof.c` is growing too long, so trying to modularize it. There are
a few internal functions that had to be exposed but I think it is a
fair trade-off.
extent_register may only fail if the underlying extent and region got stolen /
coalesced before we lock. Avoid doing extent_leak (which purges the region)
since we don't really own the region.
This can only happen on Windows and with opt.retain disabled (which isn't the
default). The solution is suboptimal, however not a common case as retain is
the long term plan for all platforms anyway.
The VirtualAlloc and VirtualFree APIs are different because MEM_DECOMMIT cannot
be used across multiple VirtualAlloc regions. To properly support decommit,
only allow merge / split within the same region -- this is done by tracking the
"is_head" state of extents and not merging cross-region.
Add a new state is_head (only relevant for retain && !maps_coalesce), which is
true for the first extent in each VirtualAlloc region. Determine if two extents
can be merged based on the head state, and use serial numbers for sanity checks.
The original logic can be disastrous if `PROF_DUMP_BUFSIZE` is less
than `slen` -- `prof_dump_buf_end + slen <= PROF_DUMP_BUFSIZE` would
always be `false`, so `memcpy` would always try to copy
`PROF_DUMP_BUFSIZE - prof_dump_buf_end` chars, which can be
dangerous: in the last round of the `while` loop it would not only
illegally read the memory beyond `s` (which might not always be
disastrous), but it would also illegally overwrite the memory beyond
`prof_dump_buf` (which can be pretty disastrous). `slen` probably
has never gone beyond `PROF_DUMP_BUFSIZE` so we were just lucky.
`cbopaque` can now be overriden without overriding `write_cb` in
the first place. (Otherwise there would be no need to have the
`cbopaque` parameter in `malloc_message`.)
Background threads may run for a long time, especially when the # of dirty pages
is high. Avoid blocking stats calls because of this (which may cause latency
spikes).
The new experimental mallctl exposes the arena pactive counter to applications,
which allows fast read w/o going through the mallctl / epoch steps. This is
particularly useful when frequent balancing is required, e.g. when having
multiple manual arenas, and threads are multiplexed to them based on usage.
If the confirm_conf option is set, when the program starts, each of
the four malloc_conf strings will be printed, and each option will
be printed when being set.
GCC-9.1 reports following error when trying to compile file
src/malloc_io.c and with CFLAGS='-Werror' :
src/malloc_io.c: In function ‘malloc_vsnprintf’:
src/malloc_io.c:369:2: error: case label value exceeds maximum value for type [-Werror]
369 | case '?' | 0x80: \
| ^~~~
src/malloc_io.c:581:5: note: in expansion of macro ‘GET_ARG_NUMERIC’
581 | GET_ARG_NUMERIC(val, 'p');
| ^~~~~~~~~~~~~~~
...
<snip>
cc1: all warnings being treated as errors
make: *** [Makefile:388: src/malloc_io.sym.o] Error 1
The warning is reported as by default the type 'char' is 'signed char'
and or-ing 0x80 will turn the case label char negative which will be
beyond the printable ascii range (0 - 127).
The patch fixes this by explicitly casting the 'len' variable as
unsigned char' inside the 'switch' statement so that value of
expression " '?' | 0x80 " falls within the legal values of the
variable 'len'.
Small is added purely for convenience. Large flushes wasn't tracked before and
can be useful in analysis. Large fill simply reports nmalloc, since there is no
batch fill for large currently.
This option saves a few CPU cycles, but potentially adds a lot of
fragmentation - so much so that there are workarounds like
max_active. Instead, let's just drop it entirely. It only made
a difference in one service I tested (.3% cpu regression), while
many services saw a memory win (also small, less than 1% mem P99)
When config_stats is enabled track the size of bin->slabs_nonfull in
the new nonfull_slabs counter in bin_stats_t. This metric should be
useful for establishing an upper ceiling on the savings possible by
meshing.
Summary: sdallocx is checking a flag that will never be set (at least in the provided C++ destructor implementation). This branch will probably only rarely be mispredicted however it removes two instructions in sdallocx and one at the callsite (to zero out flags).
This is discovered and suggested by @jasone in #1468. When custom extent hooks
are in use, we should ensure page alignment on the extent alloc path, instead of
relying on the user hooks to do so.
The analytics tool is put under experimental.utilization namespace in
mallctl. Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.
Proposed fix for #1444 - ensure that `tls_callback` in the `#pragma comment(linker)`directive gets the same prefix added as it does i the C declaration.
This feature uses an dedicated arena to handle huge requests, which
significantly improves VM fragmentation. In production workload we tested it
often reduces VM size by >30%.
For low arena count settings, the huge threshold feature may trigger an unwanted
bg thd creation. Given that the huge arena does eager purging by default,
bypass bg thd creation when initializing the huge arena.
When custom extent_hooks or transparent huge pages are in use, the purging
semantics may change, which means we may not get zeroed pages on repopulating.
Fixing the issue by manually memset for such cases.
This makes it possible to have multiple set of bins in an arena, which improves
arena scalability because the bins (especially the small ones) are always the
limiting factor in production workload.
A bin shard is picked on allocation; each extent tracks the bin shard id for
deallocation. The shard size will be determined using runtime options.
If there are 3 or more threads spin-waiting on the same mutex,
there will be excessive exclusive cacheline contention because
pthread_trylock() immediately tries to CAS in a new value, instead
of first checking if the lock is locked.
This diff adds a 'locked' hint flag, and we will only spin wait
without trylock()ing while set. I don't know of any other portable
way to get the same behavior as pthread_mutex_lock().
This is pretty easy to test via ttest, e.g.
./ttest1 500 3 10000 1 100
Throughput is nearly 3x as fast.
This blames to the mutex profiling changes, however, we almost never
have 3 or more threads contending in properly configured production
workloads, but still worth fixing.
Refactor tcache_fill, introducing a new function arena_slab_reg_alloc_batch,
which will fill multiple pointers from a slab.
There should be no functional changes here, but allows future optimization
on reg_alloc_batch.
Add unsized and sized deallocation fastpaths. Similar to the malloc()
fastpath, this removes all frame manipulation for the majority of
free() calls. The performance advantages here are less than that
of the malloc() fastpath, but from prod tests seems to still be half
a percent or so of improvement.
Stats and sampling a both supported (sdallocx needs a sampling check,
for rtree lookups slab will only be set for unsampled objects).
We don't support flush, any flush requests go to the slowpath.
We eagerly coalesce large buffers when deallocating, however the previous logic
around this introduced extra lock overhead -- when coalescing we always lock the
neighbors even if they are active, while for active extents nothing can be done.
This commit checks if the neighbor extents are potentially active before
locking, and avoids locking if possible. This speeds up large_dalloc by ~20%.
It also fixes some undesired behavior: we could stop coalescing because a small
buffer was merged, while a large neighbor was ignored on the other side.
When retain is enabled, the default dalloc hook does nothing (since we avoid
munmap). But the overhead preparing the call is high, specifically the extent
de-register and re-register involve locking and extent / rtree modifications.
Bypass the call with retain in this diff.
This diff adds a fastpath that assumes size <= SC_LOOKUP_MAXCLASS, and
that we hit tcache. If either of these is false, we fall back to
the previous codepath (renamed 'malloc_default').
Crucially, we only tail call malloc_default, and with the same kind
and number of arguments, so that both clang and gcc tail-calling
will kick in - therefore malloc() gets treated as a leaf function,
and there are *no* caller-saved registers. Previously malloc() contained
5 caller saved registers on x64, resulting in at least 10 extra
memory-movement instructions.
In microbenchmarks this results in up to ~10% improvement in malloc()
fastpath. In real programs, this is a ~1% CPU and latency improvement
overall.
The experimental `smallocx` API is not exposed via header files,
requiring the users to peek at `jemalloc`'s source code to manually
add the external declarations to their own programs.
This should reinforce that `smallocx` is experimental, and that `jemalloc`
does not offer any kind of backwards compatiblity or ABI gurantees for it.
---
Motivation:
This new experimental memory-allocaction API returns a pointer to
the allocation as well as the usable size of the allocated memory
region.
The `s` in `smallocx` stands for `sized`-`mallocx`, attempting to
convey that this API returns the size of the allocated memory region.
It should allow C++ P0901r0 [0] and Rust Alloc::alloc_excess to make
use of it.
The main purpose of these APIs is to improve telemetry. It is more accurate
to register `smallocx(size, flags)` than `smallocx(nallocx(size), flags)`,
for example. The latter will always line up perfectly with the existing
size classes, causing a loss of telemetry information about the internal
fragmentation induced by potentially poor size-classes choices.
Instrumenting `nallocx` does not help much since user code can cache its
result and use it repeatedly.
---
Implementation:
The implementation adds a new `usize` option to `static_opts_s` and an `usize`
variable to `dynamic_opts_s`. These are then used to cache the result of
`sz_index2size` and similar functions in the code paths in which they are
unconditionally invoked. In the code-paths in which these functions are not
unconditionally invoked, `smallocx` calls, as opposed to `mallocx`, these
functions explicitly.
---
[0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html
generation of sub bytes_until_sample, usize; je; for x86 arch.
Subtraction is unconditional, and only flags are checked for the jump,
no extra compare is necessary. This also reduces register pressure.
If we assume SC_LARGE_MAXCLASS will always fit in a SSIZE_T, then we can
optimize some checks by unconditional subtraction, and then checking flags
only, without a compare statement in x86.
in case `malloc_read_fd` returns a negative error number, the result
would afterwards be casted to an unsigned size_t, and may have
theoretically caused an out-of-bounds memory access in the following
`strncmp` call.
This makes it directly use MAP_EXCL and MAP_ALIGNED() instead
of weird workarounds involving mapping at random places and then
unmapping parts of them.
- Make API more clear for using as standalone json emitter
- Support cases that weren't possible before, e.g.
- emitting primitive values in an array
- emitting nested arrays
In case of multithreaded fork, we want to leave the child in a reasonable state,
in which tsd_nominal_tsds is either empty or contains only the forking thread.
The global data is mostly only used at initialization, or for easy access to
values we could compute statically. Instead of consuming that space (and
risking TLB misses), we can just pass around a pointer to stack data during
bootstrapping.
The largest small class, smallest large class, and largest large class may all
be needed down fast paths; to avoid the risk of touching another cache line, we
can make them available as constants.
I.e., parse before booting the bin module or sz module. This lets us tweak size
class settings before committing to them by letting them leak into other
modules.
This commit does not actually do any tweaking of the size classes; it *just*
chanchanges bootstrapping order; this may help bisecting any bootstrapping
failures on poorly-tested architectures.
This is the last big step in making size classes a runtime computation rather
than a configure-time one.
The compile-time computation has been left in, for now, to allow assertion
checking that the results are identical.
This class removes almost all the dependencies on size_classes.h, accessing the
data there only via the new module sc.h, which does not depend on any
configuration options.
In a subsequent commit, we'll remove the configure-time size class computations,
doing them at boot time, instead.
Before this commit jemalloc produced many warnings when compiled with -Wextra
with both Clang and GCC. This commit fixes the issues raised by these warnings
or suppresses them if they were spurious at least for the Clang and GCC
versions covered by CI.
This commit:
* adds `JEMALLOC_DIAGNOSTIC` macros: `JEMALLOC_DIAGNOSTIC_{PUSH,POP}` are
used to modify the stack of enabled diagnostics. The
`JEMALLOC_DIAGNOSTIC_IGNORE_...` macros are used to ignore a concrete
diagnostic.
* adds `JEMALLOC_FALLTHROUGH` macro to explicitly state that falling
through `case` labels in a `switch` statement is intended
* Removes all UNUSED annotations on function parameters. The warning
-Wunused-parameter is now disabled globally in
`jemalloc_internal_macros.h` for all translation units that include
that header. It is never re-enabled since that header cannot be
included by users.
* locally suppresses some -Wextra diagnostics:
* `-Wmissing-field-initializer` is buggy in older Clang and GCC versions,
where it does not understanding that, in C, `= {0}` is a common C idiom
to initialize a struct to zero
* `-Wtype-bounds` is suppressed in a particular situation where a generic
macro, used in multiple different places, compares an unsigned integer for
smaller than zero, which is always true.
* `-Walloc-larger-than-size=` diagnostics warn when an allocation function is
called with a size that is too large (out-of-range). These are suppressed in
the parts of the tests where `jemalloc` explicitly does this to test that the
allocation functions fail properly.
* adds a new CI build bot that runs the log unit test on CI.
Closes#1196 .
The feature allows using a dedicated arena for huge allocations. We want the
addtional arena to separate huge allocation because: 1) mixing small extents
with huge ones causes fragmentation over the long run (this feature reduces VM
size significantly); 2) with many arenas, huge extents rarely get reused across
threads; and 3) huge allocations happen way less frequently, therefore no
concerns for lock contention.
Previously, we made the user deal with this themselves, but that's not good
enough; if hooks may allocate, we should test the allocation pathways down
hooks. If we're doing that, we might as well actually implement the protection
for the user.
The hook module allows a low-reader-overhead way of finding hooks to invoke and
calling them.
For now, none of the allocation pathways are tied into the hooks; this will come
later.
"Hooks" is really the best name for the module that will contain the publicly
exposed hooks. So lets rename the current "hooks" module (that hook external
dependencies, for reentrancy testing) to "test_hooks".
When configured with --with-lg-page, it's possible for the configured page size
to be greater than the system page size, in which case the page address may only
be aligned with the system page size.
Previously, we would leak the extent and memory associated with a salvageable
portion of an extent that we were trying to split in three, in the case where
the first split attempt succeeded and the second failed.
Looking at the thread counts in our services, jemalloc's background thread
is useful, but mostly idle. Add a config option to tune down the number of threads.
preserve_lru feature adds lots of complication, for little value.
Removing it means merged extents are re-added to the lru list, and may
take longer to madvise away than they otherwise would.
Canaries after removal seem flat for several services (no change).
"always" marks all user mappings as MADV_HUGEPAGE; while "never" marks all
mappings as MADV_NOHUGEPAGE. The default setting "default" does not change any
settings. Note that all the madvise calls are part of the default extent hooks
by design, so that customized extent hooks have complete control over the
mappings including hugepage settings.
We have a buffer overrun that manifests in the case where arena indices higher
than the number of CPUs are accessed before arena indices lower than the number
of CPUs. This fixes the bug and adds a test.
On glibc and Android's bionic, strerror_r returns char* when
_GNU_SOURCE is defined.
Add a configure check for this rather than assume glibc is the
only libc that behaves this way.
We compute the max size required to satisfy an alignment. However this can be
quite pessimistic, especially with frequent reuse (and combined with state-based
fragmentation). This commit adds one more fit step specific to aligned
allocations, searching in all potential fit size classes.
The arena-associated stats are now all prefixed with arena_stats_, and live in
their own file. Likewise, malloc_bin_stats_t -> bin_stats_t, also in its own
file.
When purging, large allocations are usually the ones that cross the npages_limit
threshold, simply because they are "large". This means we often leave the large
extent around for a while, which has the downsides of: 1) high RSS and 2) more
chance of them getting fragmented. Given that they are not likely to be reused
very soon (LRU), let's over purge by 1 extent (which is often large and not
reused frequently).
Coalescing is a small price to pay for large allocations since they happen less
frequently. This reduces fragmentation while also potentially improving
locality.
When allocating from dirty extents (which we always prefer if available), large
active extents can get split even if the new allocation is much smaller, in
which case the introduced fragmentation causes high long term damage. This new
option controls the threshold to reuse and split an existing active extent. We
avoid using a large extent for much smaller sizes, in order to reduce
fragmentation. In some workload, adding the threshold improves virtual memory
usage by >10x.