1) Re-organize TSD so that frequently accessed fields are closer to the
beginning and more compact. Assuming 64-bit, the first 2.5 cachelines now
contains everything needed on tcache fast path, expect the tcache struct itself.
2) Re-organize tcache and tbins. Take lg_fill_div out of tbin, and reduce tbin
to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and
place tbins_small close to the beginning.
The embedded tcache is initialized upon tsd initialization. The avail arrays
for the tbins will be allocated / deallocated accordingly during init / cleanup.
With this change, the pointer to the auto tcache will always be available, as
long as we have access to the TSD. tcache_available() (called in tcache_get())
is provided to check if we should use tcache.
This will facilitate embedding tcache into tsd, which will require proper
initialization cannot be done via the static initializer. Make tsd->rtree_ctx
to be initialized via rtree_ctx_data_init().
Compact extent_t to 128 bytes on 64-bit systems by moving
arena_slab_data_t's nfree into extent_t's e_bits.
Cacheline-align extent_t structures so that they always cross the
minimum number of cacheline boundaries.
Re-order extent_t fields such that all fields except the slab bitmap
(and overlaid heap profiling context pointer) are in the first
cacheline.
This resolves#461.
Remove tree-structured bitmap support, in order to reduce complexity and
ease maintenance. No bitmaps larger than 512 bits have been necessary
since before 4.0.0, and there is no current plan that would increase
maximum bitmap size. Although tree-structured bitmaps were used on
32-bit platforms prior to this change, the overall benefits were
questionable (higher metadata overhead, higher bitmap modification cost,
marginally lower search cost).
This fixes an extent searching regression on 32-bit systems, caused by
the initial bitmap_ffu() implementation in
c8021d01f6 (Implement bitmap_ffu(), which
finds the first unset bit.), as first used in
5d33233a5e (Use a bitmap in extents_t to
speed up search.).
A fixed max spin count is used -- with benchmark results showing it
solves almost all problems. As the benchmark used was rather intense,
the upper bound could be a little bit high. However it should offer a
good tradeoff between spinning and blocking.
Use tsd_rtree_ctx() rather than tsdn_rtree_ctx() when tcache is
non-NULL, in order to avoid an extra branch (and potentially extra stack
space) in the fast path.
If a single virtual adddress pointer has enough unused bits to pack
{szind_t, extent_t *, bool, bool}, use a single pointer-sized field in
each rtree leaf element, rather than using three separate fields. This
has little impact on access speed (fewer loads/stores, but more bit
twiddling), except that denser representation increases TLB
effectiveness.
Expand and restructure the rtree API such that all common operations can
be achieved with minimal work, regardless of whether the rtree leaf
fields are independent versus packed into a single atomic pointer.
This allows leaf elements to differ in size from internal node elements.
In principle it would be more correct to use a different type for each
level of the tree, but due to implementation details related to atomic
operations, we use casts anyway, thus counteracting the value of
additional type correctness. Furthermore, such a scheme would require
function code generation (via cpp macros), as well as either unwieldy
type names for leaves or type aliases, e.g.
typedef struct rtree_elm_d2_s rtree_leaf_elm_t;
This alternate strategy would be more correct, and with less code
duplication, but probably not worth the complexity.
Rather than storing usize only for large (and prof-promoted)
allocations, store the size class index for allocations that reside
within the extent, such that the size class index is valid for all
extents that contain extant allocations, and invalid otherwise (mainly
to make debugging simpler).
Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether. Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.
The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}
This resolves#521.
Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on. This
prepares for supporting both lazy and forced purging on different decay
schedules.