Background threads may run for a long time, especially when the # of dirty pages
is high. Avoid blocking stats calls because of this (which may cause latency
spikes).
The new experimental mallctl exposes the arena pactive counter to applications,
which allows fast read w/o going through the mallctl / epoch steps. This is
particularly useful when frequent balancing is required, e.g. when having
multiple manual arenas, and threads are multiplexed to them based on usage.
If the confirm_conf option is set, when the program starts, each of
the four malloc_conf strings will be printed, and each option will
be printed when being set.
GCC-9.1 reports following error when trying to compile file
src/malloc_io.c and with CFLAGS='-Werror' :
src/malloc_io.c: In function ‘malloc_vsnprintf’:
src/malloc_io.c:369:2: error: case label value exceeds maximum value for type [-Werror]
369 | case '?' | 0x80: \
| ^~~~
src/malloc_io.c:581:5: note: in expansion of macro ‘GET_ARG_NUMERIC’
581 | GET_ARG_NUMERIC(val, 'p');
| ^~~~~~~~~~~~~~~
...
<snip>
cc1: all warnings being treated as errors
make: *** [Makefile:388: src/malloc_io.sym.o] Error 1
The warning is reported as by default the type 'char' is 'signed char'
and or-ing 0x80 will turn the case label char negative which will be
beyond the printable ascii range (0 - 127).
The patch fixes this by explicitly casting the 'len' variable as
unsigned char' inside the 'switch' statement so that value of
expression " '?' | 0x80 " falls within the legal values of the
variable 'len'.
Small is added purely for convenience. Large flushes wasn't tracked before and
can be useful in analysis. Large fill simply reports nmalloc, since there is no
batch fill for large currently.
This option saves a few CPU cycles, but potentially adds a lot of
fragmentation - so much so that there are workarounds like
max_active. Instead, let's just drop it entirely. It only made
a difference in one service I tested (.3% cpu regression), while
many services saw a memory win (also small, less than 1% mem P99)
When config_stats is enabled track the size of bin->slabs_nonfull in
the new nonfull_slabs counter in bin_stats_t. This metric should be
useful for establishing an upper ceiling on the savings possible by
meshing.
Summary: sdallocx is checking a flag that will never be set (at least in the provided C++ destructor implementation). This branch will probably only rarely be mispredicted however it removes two instructions in sdallocx and one at the callsite (to zero out flags).
This is discovered and suggested by @jasone in #1468. When custom extent hooks
are in use, we should ensure page alignment on the extent alloc path, instead of
relying on the user hooks to do so.
The analytics tool is put under experimental.utilization namespace in
mallctl. Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.
Proposed fix for #1444 - ensure that `tls_callback` in the `#pragma comment(linker)`directive gets the same prefix added as it does i the C declaration.
This feature uses an dedicated arena to handle huge requests, which
significantly improves VM fragmentation. In production workload we tested it
often reduces VM size by >30%.
For low arena count settings, the huge threshold feature may trigger an unwanted
bg thd creation. Given that the huge arena does eager purging by default,
bypass bg thd creation when initializing the huge arena.
When custom extent_hooks or transparent huge pages are in use, the purging
semantics may change, which means we may not get zeroed pages on repopulating.
Fixing the issue by manually memset for such cases.
This makes it possible to have multiple set of bins in an arena, which improves
arena scalability because the bins (especially the small ones) are always the
limiting factor in production workload.
A bin shard is picked on allocation; each extent tracks the bin shard id for
deallocation. The shard size will be determined using runtime options.
If there are 3 or more threads spin-waiting on the same mutex,
there will be excessive exclusive cacheline contention because
pthread_trylock() immediately tries to CAS in a new value, instead
of first checking if the lock is locked.
This diff adds a 'locked' hint flag, and we will only spin wait
without trylock()ing while set. I don't know of any other portable
way to get the same behavior as pthread_mutex_lock().
This is pretty easy to test via ttest, e.g.
./ttest1 500 3 10000 1 100
Throughput is nearly 3x as fast.
This blames to the mutex profiling changes, however, we almost never
have 3 or more threads contending in properly configured production
workloads, but still worth fixing.