This saves us a cache miss when lookup up the arena bin offset in a remote
arena during tcache flush. All arenas share the base offset, and so we don't
need to look it up repeatedly for each arena. Secondarily, it shaves 288 bytes
off the arena on, e.g., x86-64.
The items we pick to flush matter a lot, but the order in which they get flushed
doesn't; just use forward scans. This simplifies the accessing code, both in
terms of the C and the generated assembly (i.e. this speeds up the flush
pathways).
By carefully force-inlining the division constants and the operation sum count,
we can eliminate redundant operations in the arena-level dalloc function. Do
so.
This frontloads more of the miss latency. It also moves it to a pathway where
we have not yet acquired any locks, so that it should (hopefully) reduce hold
times.
In practice, many rtree_leaf_elm accesses are cache misses. By restructuring,
we can make it more likely that these misses occur without blocking us from
starting later lookups, taking more of those misses in parallel.
This fixes an incorrect debug-mode assert:
- T1 starts an arena stats update and reads stack_head from another thread's
cache bin, when that cache bin has 1 item in it.
- T2 allocates from that cache bin. The cache_bin's stack_head now points to a
NULL pointer, since the cache bin is empty.
- T1 Re-reads the cache_bin's stack_head to perform an assertion check (since it
previously saw that the bin was empty, whatever stack_head points to should be
non-NULL).
Previously all the small size classes were cached. However this has downsides
-- particularly when page size is greater than 4K (e.g. iOS), which will result
in much higher SMALL_MAXCLASS.
This change allows tcache_max to be set to lower values, to better control
resources taken by tcache.
For locality reasons, tcache bins are integrated in TSD. Allowing all size
classes to be cached has little benefit, but takes up much thread local storage.
In addition, it complicates the layout which we try hard to optimize.
Without a lock held continuously between checking tcaches_past and incrementing
it, it's possible for two threads to go down manual creation path
simultaneously. If the number of tcaches is one less than the maximum, it's
possible for both to create a tcache and increment tcaches_past, with the second
thread returning a value larger than TCACHES_MAX.
In making these settings configurable, 634afc4124
unintentially changed a tuning parameter (reducing the "goal" max by a factor of
4). This commit undoes that change.
This lets us put more allocations on an "almost as fast" path after a flush.
This results in around a 4% reduction in malloc cycles in prod workloads
(corresponding to about a 0.1% reduction in overall cycles).
I.e. the tcache code just calls a cache-bin function to finish flush (and move
pointers around, etc.). It doesn't directly access the cache-bin's owned memory
any more.
Previously, we took an array of cache_bin_info_ts and an index, and dereferenced
ourselves. But infos for other cache_bins aren't relevant to any particular
cache bin, so that should be the caller's job.
This is debug only and we keep it off the fast path. Moving it here simplifies
the internal logic.
This never tries to junk on regions that were shrunk via xallocx. I think this
is fine for two reasons:
- The shrunk-with-xallocx case is rare.
- We don't always do that anyway before this diff (it depends on the opt
settings and extent hooks in effect).
The small and large pathways share most of their logic, even if some of the
individual operations are different. We pull out the common logic into a
force-inlined function, and then specialize twice, once for each value of
"small".