In previous designs, this was intended to be a sort of cache that couldn't fail.
In the current design, we want to use it just as a contention reduction
mechanism. Rewrite it with those goals in mind.
This (experimental, undocumented) functionality can be used by users to track
various statistics of interest at a finer level of granularity than the thread.
Previously all the small size classes were cached. However this has downsides
-- particularly when page size is greater than 4K (e.g. iOS), which will result
in much higher SMALL_MAXCLASS.
This change allows tcache_max to be set to lower values, to better control
resources taken by tcache.
This functions more like the serial number strategy of the ecache and
hpa_central_t. Longer-lived slabs are more likely to continue to live for
longer in the future.
For locality reasons, tcache bins are integrated in TSD. Allowing all size
classes to be cached has little benefit, but takes up much thread local storage.
In addition, it complicates the layout which we try hard to optimize.
Without a lock held continuously between checking tcaches_past and incrementing
it, it's possible for two threads to go down manual creation path
simultaneously. If the number of tcaches is one less than the maximum, it's
possible for both to create a tcache and increment tcaches_past, with the second
thread returning a value larger than TCACHES_MAX.
This comes in handy when overriding earlier settings to test alternate ones. We
don't really include tests for this, but I claim that's OK here:
- It's fairly straightforward
- It's fairly hard to test well
- This entire code path is undocumented and mostly for our internal
experimentation in the first place.
- I tested manually.
This will be the centralized component of the coming hugepage allocator; the
source of larger chunks of memory from which smaller ones can be obtained.
This introduces a new sort of edata_t; a pageslab, and a set to manage them.
This is part of a series of a commits to implement a hugepage allocator; the
pageset will be per-arena, and track small page allocations requests within a
larger extent allocated from a centralized hugepage allocator.
This allows setting arenas per cpu dynamically, rather than forcing the user to
know the number of CPUs in advance if they want a particular CPU/space tradeoff.
The existing checks are good at finding such issues (on tcache flush), but not
so good at pinpointing them. Debug mode can find them, but sometimes debug mode
slows down a program so much that hard-to-hit bugs can take a long time to
crash.
This commit adds functionality to keep programs mostly on their fast paths,
while also checking every sized delete argument they get.
The sized dealloc checks called the generic safety_check_fail, and then called
abort. This means the failure case isn't mockable, hence not testable. Fix it
in anticipation of a coming diff.
This gives more accurate attribution of bytes and counts to stack traces,
without introducing backwards incompatibilities in heap-profile parsing tools.
We track the ideal reported (to the end user) number of bytes more carefully
inside core jemalloc. When dumping heap profiles, insteading of outputting our
counts directly, we output counts that will cause parsing tools to give a result
close to the value we want.
We retain the old version as an opt setting, to let users who are tracking
values on a per-component basis to keep their metrics stable until they decide
to switch.