`tcache->prof_accumbytes` should always be cleared after being
transferred to arena; otherwise the allocations would be double
counted, leading to excessive prof dumps.
JSON format is largely meant for machine-machine communication, so
adding the option to the emitter. According to local testing, the
savings in terms of bytes outputted is around 50% for stats printing
and around 25% for prof log printing.
Refactored core profiling codebase into two logical parts:
(a) `prof_data.c`: core internal data structure managing & dumping;
(b) `prof.c`: mutexes & outward-facing APIs.
Some internal functions had to be exposed out, but there are not
that many of them if the modularization is (hopefully) clean enough.
Prof logging is conceptually seperate from core profiling, so
split it out as a module of its own. There are a few internal
functions that had to be exposed but I think it is a fair trade-off.
Augmented the tsd layout graph so that the two recently added fields,
`offset_state` and `bytes_until_sample`, are properly reflected.
As is shown, the cache footprint is 16 bytes larger than before.
W/o retain, split and merge are disallowed on Windows. Avoid doing first-fit
which needs splitting almost always. Instead, try exact fit only and bail out
early.
Refactored core profiling codebase into two logical parts:
(a) `prof_data.c`: core internal data structure managing & dumping;
(b) `prof.c`: mutexes & outward-facing APIs.
Some internal functions had to be exposed out, but there are not
that many of them if the modularization is (hopefully) clean enough.
`prof.c` is growing too long, so trying to modularize it. There are
a few internal functions that had to be exposed but I think it is a
fair trade-off.
extent_register may only fail if the underlying extent and region got stolen /
coalesced before we lock. Avoid doing extent_leak (which purges the region)
since we don't really own the region.
This can only happen on Windows and with opt.retain disabled (which isn't the
default). The solution is suboptimal, however not a common case as retain is
the long term plan for all platforms anyway.
The VirtualAlloc and VirtualFree APIs are different because MEM_DECOMMIT cannot
be used across multiple VirtualAlloc regions. To properly support decommit,
only allow merge / split within the same region -- this is done by tracking the
"is_head" state of extents and not merging cross-region.
Add a new state is_head (only relevant for retain && !maps_coalesce), which is
true for the first extent in each VirtualAlloc region. Determine if two extents
can be merged based on the head state, and use serial numbers for sanity checks.
The original logic can be disastrous if `PROF_DUMP_BUFSIZE` is less
than `slen` -- `prof_dump_buf_end + slen <= PROF_DUMP_BUFSIZE` would
always be `false`, so `memcpy` would always try to copy
`PROF_DUMP_BUFSIZE - prof_dump_buf_end` chars, which can be
dangerous: in the last round of the `while` loop it would not only
illegally read the memory beyond `s` (which might not always be
disastrous), but it would also illegally overwrite the memory beyond
`prof_dump_buf` (which can be pretty disastrous). `slen` probably
has never gone beyond `PROF_DUMP_BUFSIZE` so we were just lucky.
`cbopaque` can now be overriden without overriding `write_cb` in
the first place. (Otherwise there would be no need to have the
`cbopaque` parameter in `malloc_message`.)
The file is included in the list of source files in Makefile.in,
but it is missing from the project files. This causes the
build to fail due to unresolved symbols.
Background threads may run for a long time, especially when the # of dirty pages
is high. Avoid blocking stats calls because of this (which may cause latency
spikes).
The new experimental mallctl exposes the arena pactive counter to applications,
which allows fast read w/o going through the mallctl / epoch steps. This is
particularly useful when frequent balancing is required, e.g. when having
multiple manual arenas, and threads are multiplexed to them based on usage.
If the confirm_conf option is set, when the program starts, each of
the four malloc_conf strings will be printed, and each option will
be printed when being set.
GCC-9.1 reports following error when trying to compile file
src/malloc_io.c and with CFLAGS='-Werror' :
src/malloc_io.c: In function ‘malloc_vsnprintf’:
src/malloc_io.c:369:2: error: case label value exceeds maximum value for type [-Werror]
369 | case '?' | 0x80: \
| ^~~~
src/malloc_io.c:581:5: note: in expansion of macro ‘GET_ARG_NUMERIC’
581 | GET_ARG_NUMERIC(val, 'p');
| ^~~~~~~~~~~~~~~
...
<snip>
cc1: all warnings being treated as errors
make: *** [Makefile:388: src/malloc_io.sym.o] Error 1
The warning is reported as by default the type 'char' is 'signed char'
and or-ing 0x80 will turn the case label char negative which will be
beyond the printable ascii range (0 - 127).
The patch fixes this by explicitly casting the 'len' variable as
unsigned char' inside the 'switch' statement so that value of
expression " '?' | 0x80 " falls within the legal values of the
variable 'len'.
Small is added purely for convenience. Large flushes wasn't tracked before and
can be useful in analysis. Large fill simply reports nmalloc, since there is no
batch fill for large currently.
This option saves a few CPU cycles, but potentially adds a lot of
fragmentation - so much so that there are workarounds like
max_active. Instead, let's just drop it entirely. It only made
a difference in one service I tested (.3% cpu regression), while
many services saw a memory win (also small, less than 1% mem P99)
When config_stats is enabled track the size of bin->slabs_nonfull in
the new nonfull_slabs counter in bin_stats_t. This metric should be
useful for establishing an upper ceiling on the savings possible by
meshing.