Summary:
Add support for C++17 over-aligned allocation:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0035r4.html
Supporting all 10 operators means we avoid thunking thru libstdc++-v3/libsupc++ and just call jemalloc directly.
It's also worth noting that there is now an aligned *and sized* operator delete:
```
void operator delete(void* ptr, std::size_t size, std::align_val_t al) noexcept;
```
If JeMalloc did not provide this, the default implementation would ignore the size parameter entirely:
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/libsupc%2B%2B/del_opsa.cc#L30-L33
(I must also update ax_cxx_compile_stdcxx.m4 to a newer version with C++17 support.)
Test Plan:
Wrote a simple test that allocates and then deletes an over-aligned type:
```
struct alignas(32) Foo {};
Foo *f;
int main()
{
f = new Foo;
delete f;
}
```
Before this change, both new and delete go thru PLT, and we end up calling regular old free:
```
(gdb) disassemble
Dump of assembler code for function main():
...
0x00000000004029b7 <+55>: call 0x4022d0 <_ZnwmSt11align_val_t@plt>
...
0x00000000004029d5 <+85>: call 0x4022e0 <_ZdlPvmSt11align_val_t@plt>
...
(gdb) s
free (ptr=0x7ffff6408020) at /home/engshare/third-party2/jemalloc/master/src/jemalloc.git-trunk/src/jemalloc.c:2842
2842 if (!free_fastpath(ptr, 0, false)) {
```
After this change, we directly call new/delete and ultimately call sdallocx:
```
(gdb) disassemble
Dump of assembler code for function main():
...
0x0000000000402b77 <+55>: call 0x496ca0 <operator new(unsigned long, std::align_val_t)>
...
0x0000000000402b95 <+85>: call 0x496e60 <operator delete(void*, unsigned long, std::align_val_t)>
...
(gdb) s
116 je_sdallocx_noflags(ptr, size);
```
Makes the prof sample prng use the tsd prng_state. This allows us to properly
initialize the sample interval event, without having to create tdata. As a
result, tdata will be created on demand (when a thread reaches the sample
interval bytes allocated), instead of on the first allocation.
Clang since r369414 (clang-10) can now check -Wimplicit-fallthrough for
C code, and use the GNU C style attribute to denote fallthrough.
Move the test from header only to autoconf. The previous test used
brittle version detection which did not work for newer clang that
supported this feature.
The attribute has to be its own statement, hence the added `;`. It also
can only precede case statements, so the final cases should be
explicitly terminated with break statements.
Fixes commit 3d29d11ac2 ("Clean compilation -Wextra")
Link: 1e0affb6e5
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
`tcache_bin_info` is not accessed on malloc fast path but the
compiler reserves a register for it, as well as an additional
register for `tcache_bin_info[ind].stack_size`. The optimization
gets rid of the need for the two registers.
This change suppresses tdata initialization and prof sample threshold
update in interrupting malloc calls. Interrupting calls have no need
for tdata. Delaying tdata creation aligns better with our lazy tdata
creation principle, and it also helps us gain control back from
interrupting calls more quickly and reduces any risk of delegating
tdata creation to an interrupting call.
Specifically, the extent_arena_[g|s]et functions and the address randomization.
These are the only things that tie the extent struct itself to the arena code.
Added a new stats row to aggregate the maximum value of mutex counters for each
background threads. Given that the per bg thd mutex is not expected to be
contended, this counter is mainly for sanity check / debugging.
The -1 value of low_water indicates if the cache has been depleted and
refilled. Track the status explicitly in the tcache struct.
This allows the fast path to check if (cur_ptr > low_water), instead of >=,
which avoids reaching slow path when the last item is allocated.
With the cache bin metadata switched to pointers, ncached_max is usually
accessed and timed by sizeof(ptr). Store the results in tcache_bin_info for
direct access, and add a helper function for the ncached_max value.
Implement the pointer-based metadata for tcache bins --
- 3 pointers are maintained to represent each bin;
- 2 of the pointers are compressed on 64-bit;
- is_full / is_empty done through pointer comparison;
Comparing to the previous counter based design --
- fast-path speed up ~15% in benchmarks
- direct pointer comparison and de-reference
- no need to access tcache_bin_info in common case
JSON format is largely meant for machine-machine communication, so
adding the option to the emitter. According to local testing, the
savings in terms of bytes outputted is around 50% for stats printing
and around 25% for prof log printing.