This pulls out the various abstractions where some stats counter is sometimes an
atomic, sometimes a plain variable, sometimes always protected by a lock,
sometimes protected by reads but not writes, etc. With this change, these cases
are treated consistently, and access patterns tagged.
In the process, we fix a few missed-update bugs (where one caller assumes
"protected-by-a-lock" semantics and another does not).
This is logically at a higher level of the stack; extent should just allocate
things at the page-level; it shouldn't care exactly why the callers wants a
given number of pages.
Previously, large allocations in tcaches would have their sizes reduced during
stats estimation. Added a test, which fails before this change but passes now.
This fixes a bug introduced in 5934846612, which
was itself fixing a bug introduced in 9c0549007d.
This lets us put more allocations on an "almost as fast" path after a flush.
This results in around a 4% reduction in malloc cycles in prod workloads
(corresponding to about a 0.1% reduction in overall cycles).
With this, we track all of the empty, full, and low water states together. This
simplifies a lot of the tracking logic, since we now don't need the
cache_bin_info_t for state queries (except for some debugging).
I.e. the tcache code just calls a cache-bin function to finish flush (and move
pointers around, etc.). It doesn't directly access the cache-bin's owned memory
any more.
Previously, we took an array of cache_bin_info_ts and an index, and dereferenced
ourselves. But infos for other cache_bins aren't relevant to any particular
cache bin, so that should be the caller's job.
This is debug only and we keep it off the fast path. Moving it here simplifies
the internal logic.
This never tries to junk on regions that were shrunk via xallocx. I think this
is fine for two reasons:
- The shrunk-with-xallocx case is rare.
- We don't always do that anyway before this diff (it depends on the opt
settings and extent hooks in effect).
The small and large pathways share most of their logic, even if some of the
individual operations are different. We pull out the common logic into a
force-inlined function, and then specialize twice, once for each value of
"small".
The only time sharing an rtree context saves across extent operations isn't a
no-op is when tsd is unavailable. But this happens only in situations like
thread death or initialization, and we don't care about shaving off every
possible cycle in such scenarios.
Previously, tcache fill/flush (as well as small alloc/dalloc on the arena) may
potentially drop the bin lock for slab_alloc and slab_dalloc. This commit
refactors the logic so that the slab calls happen in the same function / level
as the bin lock / unlock. The main purpose is to be able to use flat combining
without having to keep track of stack state.
In the meantime, this change reduces the locking, especially for slab_dalloc
calls, where nothing happens after the call.
Check the is_head state before merging two extents. Disallow the merge if it's
crossing two separate mmap regions. This enforces first-fit (by not losing the
SN) at a very small cost.
Make the event module to accept two event types, and pass around the event
context. Use bytes-based events to trigger tcache GC on deallocation, and get
rid of the tcache ticker.
- NetBSD overcommits
- When mapping pages, use the maximum of the alignment requested and the
compiled-in PAGE constant which might be greater than the current kernel
pagesize, since we compile binaries with the maximum page size supported
by the architecture (so that they work with all kernels).
Add options stats_interval and stats_interval_opts to allow interval based stats
printing. This provides an easy way to collect stats without code changes,
because opt.stats_print may not work (some binaries never exit).
Eventually, we may fully break off the extent module; but not for some time. If
it's going to live on in a non-transitory state, it might as well have the nicer
name.
What we call an arena_ind is really the index associated with some particular
set of ehooks; the arena is just the user-visible portion of that. Making this
explicit, and reframing checks in terms of that, makes the code simpler and
cleaner, and helps us avoid passing the arena itself all throughout extent code.
This lets us put back an arena-specific assert.
Previously, it was really more like extents_alloc (it looks in an ecache for an
extent to reuse as its primary allocation pathway). Make that pathway more
explciitly like extents_alloc, and rename extent_alloc_wrapper_hard accordingly.
This will eventually completely wrap the eset, and handle concurrency,
allocation, and deallocation. For now, we only pull out the mutex from the
eset.
If there are custom extent hooks, pages_can_purge_lazy is not necessarily the
right guard. We could check ehooks_are_default too, but the case where
purge_lazy is unsupported is rare and getting rarer. Just checking the decay
interval captures most of the benefit.
When deferred initialization was added, initializing required copying
sizeof(extent_hooks_t) bytes after a pointer chase. Today, it's just a single
pointer loaded from the base_t. In subsequent diffs, we'll get rid of even that.
Explicitly define three setters:
- `prof_tctx_reset()`: set `prof_tctx` to `1U`, if we don't know in
advance whether the allocation is large or not;
- `prof_tctx_reset_sampled()`: set `prof_tctx` to `1U`, if we already
know in advance that the allocation is large;
- `prof_info_set()`: set a real `prof_tctx`, and also set other
profiling info e.g. the allocation time.
Code structure wise, the prof level is kept as a thin wrapper, the
large level only provides low level setter APIs, and the arena level
carries out the main logic.
Fold the tsd_state check onto the event threshold check. The fast threshold is
set to 0 when tsd switch to non-nominal.
The fast_threshold can be reset by remote threads, to refect the non nominal tsd
state change.
Develop new data structure and code logic for holding profiling
related information stored in the extent that may be needed after the
extent is released, which in particular is the case for the
reallocation code path (e.g. in `rallocx()` and `xallocx()`). The
data structure is a generalization of `prof_tctx_t`: we previously
only copy out the `prof_tctx` before the extent is released, but we
may be in need of additional fields. Currently the only additional
field is the allocation time field, but there may be more fields in
the future.
The restructuring also resolved a bug: `prof_realloc()` mistakenly
passed the new `ptr` to `prof_free_sampled_object()`, but passing in
the `old_ptr` would crash because it's already been released. Now
the essential profiling information is collectively copied out early
and safely passed to `prof_free_sampled_object()` after the extent is
released.
Summary:
Add support for C++17 over-aligned allocation:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0035r4.html
Supporting all 10 operators means we avoid thunking thru libstdc++-v3/libsupc++ and just call jemalloc directly.
It's also worth noting that there is now an aligned *and sized* operator delete:
```
void operator delete(void* ptr, std::size_t size, std::align_val_t al) noexcept;
```
If JeMalloc did not provide this, the default implementation would ignore the size parameter entirely:
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/libsupc%2B%2B/del_opsa.cc#L30-L33
(I must also update ax_cxx_compile_stdcxx.m4 to a newer version with C++17 support.)
Test Plan:
Wrote a simple test that allocates and then deletes an over-aligned type:
```
struct alignas(32) Foo {};
Foo *f;
int main()
{
f = new Foo;
delete f;
}
```
Before this change, both new and delete go thru PLT, and we end up calling regular old free:
```
(gdb) disassemble
Dump of assembler code for function main():
...
0x00000000004029b7 <+55>: call 0x4022d0 <_ZnwmSt11align_val_t@plt>
...
0x00000000004029d5 <+85>: call 0x4022e0 <_ZdlPvmSt11align_val_t@plt>
...
(gdb) s
free (ptr=0x7ffff6408020) at /home/engshare/third-party2/jemalloc/master/src/jemalloc.git-trunk/src/jemalloc.c:2842
2842 if (!free_fastpath(ptr, 0, false)) {
```
After this change, we directly call new/delete and ultimately call sdallocx:
```
(gdb) disassemble
Dump of assembler code for function main():
...
0x0000000000402b77 <+55>: call 0x496ca0 <operator new(unsigned long, std::align_val_t)>
...
0x0000000000402b95 <+85>: call 0x496e60 <operator delete(void*, unsigned long, std::align_val_t)>
...
(gdb) s
116 je_sdallocx_noflags(ptr, size);
```