Refactored core profiling codebase into two logical parts:
(a) `prof_data.c`: core internal data structure managing & dumping;
(b) `prof.c`: mutexes & outward-facing APIs.
Some internal functions had to be exposed out, but there are not
that many of them if the modularization is (hopefully) clean enough.
`prof.c` is growing too long, so trying to modularize it. There are
a few internal functions that had to be exposed but I think it is a
fair trade-off.
The VirtualAlloc and VirtualFree APIs are different because MEM_DECOMMIT cannot
be used across multiple VirtualAlloc regions. To properly support decommit,
only allow merge / split within the same region -- this is done by tracking the
"is_head" state of extents and not merging cross-region.
Add a new state is_head (only relevant for retain && !maps_coalesce), which is
true for the first extent in each VirtualAlloc region. Determine if two extents
can be merged based on the head state, and use serial numbers for sanity checks.
`cbopaque` can now be overriden without overriding `write_cb` in
the first place. (Otherwise there would be no need to have the
`cbopaque` parameter in `malloc_message`.)
If the confirm_conf option is set, when the program starts, each of
the four malloc_conf strings will be printed, and each option will
be printed when being set.
Small is added purely for convenience. Large flushes wasn't tracked before and
can be useful in analysis. Large fill simply reports nmalloc, since there is no
batch fill for large currently.
When config_stats is enabled track the size of bin->slabs_nonfull in
the new nonfull_slabs counter in bin_stats_t. This metric should be
useful for establishing an upper ceiling on the savings possible by
meshing.
Mainly fixing typos. The only non-trivial change is in the
computation for SC_NPSIZES, though the result wouldn't be any
different when SC_NGROUP = 4 as is always the case at the moment.
Summary: sdallocx is checking a flag that will never be set (at least in the provided C++ destructor implementation). This branch will probably only rarely be mispredicted however it removes two instructions in sdallocx and one at the callsite (to zero out flags).
The analytics tool is put under experimental.utilization namespace in
mallctl. Input is one pointer or an array of pointers and the output
is a list of memory utilization statistics.
This fixes a build failure when integrating with FreeBSD's libc. This
regression was introduced by d1e11d48d4
(Move tsd link and in_hook after tcache.).
This feature uses an dedicated arena to handle huge requests, which
significantly improves VM fragmentation. In production workload we tested it
often reduces VM size by >30%.
For low arena count settings, the huge threshold feature may trigger an unwanted
bg thd creation. Given that the huge arena does eager purging by default,
bypass bg thd creation when initializing the huge arena.
When custom extent_hooks or transparent huge pages are in use, the purging
semantics may change, which means we may not get zeroed pages on repopulating.
Fixing the issue by manually memset for such cases.
This makes it possible to have multiple set of bins in an arena, which improves
arena scalability because the bins (especially the small ones) are always the
limiting factor in production workload.
A bin shard is picked on allocation; each extent tracks the bin shard id for
deallocation. The shard size will be determined using runtime options.
If there are 3 or more threads spin-waiting on the same mutex,
there will be excessive exclusive cacheline contention because
pthread_trylock() immediately tries to CAS in a new value, instead
of first checking if the lock is locked.
This diff adds a 'locked' hint flag, and we will only spin wait
without trylock()ing while set. I don't know of any other portable
way to get the same behavior as pthread_mutex_lock().
This is pretty easy to test via ttest, e.g.
./ttest1 500 3 10000 1 100
Throughput is nearly 3x as fast.
This blames to the mutex profiling changes, however, we almost never
have 3 or more threads contending in properly configured production
workloads, but still worth fixing.
For a free fastpath, we want something that will not make additional
calls. Assume most free() calls will hit the L1 cache, and use
a custom rtree function for this.
Additionally, roll the ptr=NULL check in to the rtree cache check.
Nearly all 32-bit powerpc hardware treats lwsync as sync, and some cores
(Freescale e500) trap lwsync as an illegal instruction, which then gets
emulated in the kernel. To avoid unnecessary traps on the e500, use
sync on all 32-bit powerpc. This pessimizes 32-bit software running on
64-bit hardware, but those numbers should be slim.