Instead of embedding a lock bit in rtree leaf elements, we associate extents
with a small set of mutexes. This gets us two things:
- We can use the system mutexes. This (hypothetically) protects us from
priority inversion, and lets us stop doing a backoff/sleep loop, instead
opting for precise wakeups from the mutex.
- Cuts down on the number of mutex acquisitions we have to do (from 4 in the
worst case to two).
We end up simplifying most of the rtree code (which no longer has to deal with
locking or concurrency at all), at the cost of additional complexity in the
extent code: since the mutex protecting the rtree leaf elements is determined by
reading the extent out of those elements, the initial read is racy, so that we
may acquire an out of date mutex. We re-check the extent in the leaf after
acquiring the mutex to protect us from this race.
This removes the tsd macros (which are used only for tsd_t in real builds). We
break up the circular dependencies involving tsd.
We also move all tsd access through getters and setters. This allows us to
assert that we only touch data when tsd is in a valid state.
We simplify the usages of the x macro trick, removing all the customizability
(get/set, init, cleanup), moving the lifetime logic to tsd_init and tsd_cleanup.
This lets us make initialization order independent of order within tsd_t.
Two levels of rcache is implemented: a direct mapped cache as L1, combined with
a LRU cache as L2. The L1 cache offers low cost on cache hit, but could suffer
collision under circumstances. This is complemented by the L2 LRU cache, which
is slower on cache access (overhead from linear search + reordering), but solves
collison of L1 rather well.
1) Re-organize TSD so that frequently accessed fields are closer to the
beginning and more compact. Assuming 64-bit, the first 2.5 cachelines now
contains everything needed on tcache fast path, expect the tcache struct itself.
2) Re-organize tcache and tbins. Take lg_fill_div out of tbin, and reduce tbin
to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and
place tbins_small close to the beginning.
If a single virtual adddress pointer has enough unused bits to pack
{szind_t, extent_t *, bool, bool}, use a single pointer-sized field in
each rtree leaf element, rather than using three separate fields. This
has little impact on access speed (fewer loads/stores, but more bit
twiddling), except that denser representation increases TLB
effectiveness.
Expand and restructure the rtree API such that all common operations can
be achieved with minimal work, regardless of whether the rtree leaf
fields are independent versus packed into a single atomic pointer.
This allows leaf elements to differ in size from internal node elements.
In principle it would be more correct to use a different type for each
level of the tree, but due to implementation details related to atomic
operations, we use casts anyway, thus counteracting the value of
additional type correctness. Furthermore, such a scheme would require
function code generation (via cpp macros), as well as either unwieldy
type names for leaves or type aliases, e.g.
typedef struct rtree_elm_d2_s rtree_leaf_elm_t;
This alternate strategy would be more correct, and with less code
duplication, but probably not worth the complexity.
In the process, I changed the implementation of rtree_elm_acquire so that it
won't even try to CAS if its initial read (getting the extent + lock bit)
indicates that the CAS is doomed to fail. This can significantly improve
performance under contention.
Rather than dynamically building a table to aid per level computations,
define a constant table at compile time. Omit both high and low
insignificant bits. Use one to three tree levels, depending on the
number of significant bits.
Rework rtree_ctx_t to encapsulate an rtree leaf LRU lookup cache rather
than a single-path element lookup cache. The replacement is logically
much simpler, as well as slightly faster in the fast path case and less
prone to degraded performance during non-trivial sequences of lookups.
This is part of a broader change to make header files better represent the
dependencies between one another (see
https://github.com/jemalloc/jemalloc/issues/533). It breaks up component headers
into smaller parts that can be made to have a simpler dependency graph.
For the autogenerated headers (smoothstep.h and size_classes.h), no splitting
was necessary, so I didn't add support to emit multiple headers.