Previously all the small size classes were cached. However this has downsides
-- particularly when page size is greater than 4K (e.g. iOS), which will result
in much higher SMALL_MAXCLASS.
This change allows tcache_max to be set to lower values, to better control
resources taken by tcache.
For locality reasons, tcache bins are integrated in TSD. Allowing all size
classes to be cached has little benefit, but takes up much thread local storage.
In addition, it complicates the layout which we try hard to optimize.
Without a lock held continuously between checking tcaches_past and incrementing
it, it's possible for two threads to go down manual creation path
simultaneously. If the number of tcaches is one less than the maximum, it's
possible for both to create a tcache and increment tcaches_past, with the second
thread returning a value larger than TCACHES_MAX.
In making these settings configurable, 634afc4124
unintentially changed a tuning parameter (reducing the "goal" max by a factor of
4). This commit undoes that change.
This lets us put more allocations on an "almost as fast" path after a flush.
This results in around a 4% reduction in malloc cycles in prod workloads
(corresponding to about a 0.1% reduction in overall cycles).
I.e. the tcache code just calls a cache-bin function to finish flush (and move
pointers around, etc.). It doesn't directly access the cache-bin's owned memory
any more.
Previously, we took an array of cache_bin_info_ts and an index, and dereferenced
ourselves. But infos for other cache_bins aren't relevant to any particular
cache bin, so that should be the caller's job.
This is debug only and we keep it off the fast path. Moving it here simplifies
the internal logic.
This never tries to junk on regions that were shrunk via xallocx. I think this
is fine for two reasons:
- The shrunk-with-xallocx case is rare.
- We don't always do that anyway before this diff (it depends on the opt
settings and extent hooks in effect).
The small and large pathways share most of their logic, even if some of the
individual operations are different. We pull out the common logic into a
force-inlined function, and then specialize twice, once for each value of
"small".
Previously, tcache fill/flush (as well as small alloc/dalloc on the arena) may
potentially drop the bin lock for slab_alloc and slab_dalloc. This commit
refactors the logic so that the slab calls happen in the same function / level
as the bin lock / unlock. The main purpose is to be able to use flat combining
without having to keep track of stack state.
In the meantime, this change reduces the locking, especially for slab_dalloc
calls, where nothing happens after the call.
Make the event module to accept two event types, and pass around the event
context. Use bytes-based events to trigger tcache GC on deallocation, and get
rid of the tcache ticker.