As pointed out in #2434, the thread_name in prof_tdata_t was changed in #2407.
This also requires an update for the prof_recent dump, specifically the emitter
expects a "char **" which is fixed in this commit.
Static analysis flagged this. `extent_record` was passing `NULL` as the
value for `coalesced` to `extent_try_coalesce`, which in turn passes
that argument to `extent_try_coalesce_impl`, where it is written to
without checking if it is `NULL`. I can confirm from reviewing the
fleetwide coredump data that this was in fact being hit in production.
The codebase is already very disciplined in making any function which
can be `static`, but there are a few that appear to have slipped through
the cracks.
`edata_cmp_summary_comp` is one of the very hottest functions, taking up
3% of all time spent inside Jemalloc. I noticed that all existing
callsites rely only on the sign of the value returned by this function,
so I came up with this equivalent branchless implementation which
preserves this property. After empirical measurement, I have found that
this implementation is 30% faster, therefore representing a 1% speed-up
to the allocator as a whole.
At @interwq's suggestion, I've applied the same optimization to
`edata_esnead_comp` in case this function becomes hotter in the future.
This codepath may generate deferred work when the HPA is enabled.
See also [@davidtgoldblatt's relevant comment on the PR which
introduced this](https://github.com/jemalloc/jemalloc/pull/2107#discussion_r699770967)
which prevented a similarly incorrect `assert` from being added elsewhere.
It appears like a simple typo means we're unconditionally overwriting
some fields in hpa_from_pai when asserts are enabled. From hpa_shard_init,
it looks like these fields have these values anyway, so this shouldn't
cause bugs, but if something is wrong it seems better to have these
asserts in place.
See issue #2412.
Decay should not be triggered during reentrant calls (may cause lock order
reversal / deadlocks). Added a delay_trigger flag to the tickers to bypass
decay when rentrancy_level is not zero.
This lowered the sizeof(prof_tdata_t) from 200 to 192 which is a round size
class. Afterwards the tdata_t size remain unchanged with the last commit, which
effectively inlined the storage of thread names for free.
The previous approach managed the thread name in a separate buffer, which causes
races because the thread name update (triggered by new samples) can happen at
the same time as prof dumping (which reads the thread names) -- these two
operations are under separate locks to avoid blocking each other. Implemented
the thread name storage as part of the tdata struct, which resolves the lifetime
issue and also avoids internal alloc / dalloc during prof_sample.
Also fixes what looks like an off by one error in the lazy aux list
merge part of the code that previously never touched the last node in
the aux list.
It turns out that the previous commit did not suffice since the
JEMALLOC_SYS_NOTHROW definition also causes the same exception specification
errors as JEMALLOC_USE_CXX_THROW did:
```
x86_64-pc-linux-musl-cc -std=gnu11 -Werror=unknown-warning-option -Wall -Wextra -Wshorten-64-to-32 -Wsign-compare -Wundef -Wno-format-zero-length -Wpointer-
arith -Wno-missing-braces -Wno-missing-field-initializers -pipe -g3 -fvisibility=hidden -Wimplicit-fallthrough -O3 -funroll-loops -march=native -O2 -pipe -c -march=native -O2 -pipe -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/background_thread.o src/background_thread.c
In file included from src/jemalloc_cpp.cpp:9:
In file included from include/jemalloc/internal/jemalloc_preamble.h:27:
include/jemalloc/internal/../jemalloc.h:254:32: error: exception specification in declaration does not match previous declaration
void JEMALLOC_SYS_NOTHROW *je_malloc(size_t size)
^
include/jemalloc/internal/../jemalloc.h:75:21: note: expanded from macro 'je_malloc'
^
/usr/x86_64-pc-linux-musl/include/stdlib.h:40:7: note: previous declaration is here
void *malloc (size_t);
^
```
On systems using the musl C library we have to omit the exception specification
on malloc function family like it's done for MacOS, FreeBSD and OpenBSD.
The current thread name reading path updates the name every time, which requires
both alloc and dalloc -- and the temporary NULL value in the middle causes races
where the prof dump read path gets NULLed in the middle.
Minimize the changes in this commit to isolate the bugfix testing; will also
refactor the whole thread name paths later.
The added hooks hooks.prof_sample and hooks.prof_sample_free are intended to
allow advanced users to track additional information, to enable new ways of
profiling on top of the jemalloc heap profile and sample features.
The sample hook is invoked after the allocation and backtracing, and forwards
the both the allocation and backtrace to the user hook; the sample_free hook
happens before the actual deallocation, and forwards only the ptr and usz to the
hook.
Summary:
Per issue #2356, some CXX compilers may optimize away the
new/delete operation in stress/cpp/microbench.cpp.
Thus, this commit (1) bumps the time interval to 1 if it is 0, and
(2) modifies the pointers in the microbench to volatile.
Allows the use of getenv() rather than secure_getenv() to read MALLOC_CONF.
This helps in situations where hosts are under full control, and setting
MALLOC_CONF is needed while also setuid. Disabled by default.
Previously if a thread does only allocations, it stays on the slow path /
minimal initialized state forever. However, dealloc-only is a valid pattern for
dedicated reclamation threads -- this means thread cache is disabled (no batched
flush) for them, which causes high overhead and contention.
Added the condition to fully initialize TSD when a fair amount of dealloc
activities are observed.
No currently-available version of Visual Studio C compiler supports
variable length arrays, even if it defines __STDC_VERSION__ >= C99.
As far as I know Microsoft has no plans to ever support VLAs in MSVC.
The C11 standard requires that the __STDC_NO_VLA__ macro be defined if
the compiler doesn't support VLAs, so fall back to alloca() if so.
With `--with-jemalloc-prefix=` and without `-fno-builtin` or `-O1` both clang and gcc may optimize out `malloc` calls
whose result is unused. Comparing result to NULL also doesn't necessarily count as being used.
This won't be a problem in most client programs as this only concerns really unused pointers, but in
tests it's important to actually execute allocations.
`-fno-builtin` should disable this optimization for both gcc and clang, and applying it only to tests code shouldn't hopefully be an issue.
Another alternative is to force "use" of result but that'd require more changes and may miss some other optimization-related issues.
This should resolve https://github.com/jemalloc/jemalloc/issues/2091