Fix the logic in stats_print() such that if the "a" flag is passed in
without the "m" flag, merged statistics will be printed even if only one
arena is initialized.
Fix huge_ralloc() to remove the old memory region from tree of huge
allocations *before* calling mremap(2), in order to make sure that no
other thread acquires the old memory region via mmap() and encounters
stale metadata in the tree.
Reported by: Rich Prohaska
Refactor the SO and REV such that they are set via autoconf variables,
@so@ and @rev@. These variables are both needed by the jemalloc.sh
script, so this unifies their definitions.
Fix prof_lookup() to artificially raise curobjs for all paths through
the code that creates a new entry in the per thread bt2cnt hash table.
This fixes a race condition that could corrupt memory if prof_accum were
false, and a non-default lg_prof_tcmax were used and/or threads were
destroyed.
Add a missing prof_malloc() call in allocm(). Before this fix, negative
object/byte counts could be observed in heap profiles for applications
that use allocm().
Rewrite prof_alloc_prep() as a cpp macro, PROF_ALLOC_PREP(), in order to
remove any doubt as to whether an additional stack frame is created.
Prior to this change, it was assumed that inlining would reduce the
total number of frames in the backtrace, but in practice behavior wasn't
completely predictable.
Create imemalign() and call it from posix_memalign(), memalign(), and
valloc(), so that all entry points require the same number of stack
frames to be ignored during backtracing.
Properly handle boundary conditions for sampled region promotion in
rallocm(). Prior to this fix, some combinations of 'size' and 'extra'
values could cause erroneous behavior. Additionally, size class
recording for promoted regions was incorrect.
Fix assertions in arena_purge() to accurately reflect the constraints in
arena_maybe_purge(). There were two bugs here, one of which merely
weakened the assertion, and the other of which referred to an
uninitialized variable (typo; used npurgatory instead of
arena->npurgatory).
Add inline assembly implementations of atomic_{add,sub}_uint{32,64}()
for x86/x64, in order to support compilers that are missing the relevant
gcc intrinsics.
sa2u() returns 0 on overflow, but the profiling code was blindly calling
sa2u() and allowing the error to silently propagate, ultimately ending
in a later assertion failure. Refactor all ipalloc() callers to call
sa2u(), check for overflow before calling ipalloc(), and pass usize
rather than size. This allows ipalloc() to avoid calling sa2u() in the
common case.
Fix a regression due to:
Remove an arena_bin_run_size_calc() constraint.
2a6f2af6e4
The removed constraint required that small run headers fit in one page,
which indirectly limited runs such that they would not cause overflow in
arena_run_regind(). Add an explicit constraint to
arena_bin_run_size_calc() based on the largest number of regions that
arena_run_regind() can handle (2^11 as currently configured).
Dynamically adjust tcache fill count (number of objects allocated per
tcache refill) such that if GC has to flush inactive objects, the fill
count gradually decreases. Conversely, if refills occur while the fill
count is depressed, the fill count gradually increases back to its
maximum value.
pthread_mutex_lock() can call malloc() on OS X (!!!), which causes
deadlock. Work around this by using spinlocks that are built of more
primitive stuff.
Add the "stats.cactive" mallctl, which can be used to efficiently and
repeatedly query approximately how much active memory the application is
utilizing.
Rather than blindly assigning threads to arenas in round-robin fashion,
choose the lowest-numbered arena that currently has the smallest number
of threads assigned to it.
Add the "stats.arenas.<i>.nthreads" mallctl.
The previous free list implementation, which embedded singly linked
lists in available regions, had the unfortunate side effect of causing
many cache misses during thread cache fills. Fix this in two places:
- arena_run_t: Use a new bitmap implementation to track which regions
are available. Furthermore, revert to preferring the
lowest available region (as jemalloc did with its old
bitmap-based approach).
- tcache_t: Move read-only tcache_bin_t metadata into
tcache_bin_info_t, and add a contiguous array of pointers
to tcache_t in order to track cached objects. This
substantially increases the size of tcache_t, but results
in much higher data locality for common tcache operations.
As a side benefit, it is again possible to efficiently
flush the least recently used cached objects, so this
change changes flushing from MRU to LRU.
The new bitmap implementation uses a multi-level summary approach to
make finding the lowest available region very fast. In practice,
bitmaps only have one or two levels, though the implementation is
general enough to handle extremely large bitmaps, mainly so that large
page sizes can still be entertained.
Fix tcache_bin_flush_large() to always flush statistics, in the same way
that tcache_bin_flush_small() was recently fixed.
Use JEMALLOC_DEBUG rather than NDEBUG.
Add dassert(), and use it for debug-only asserts.
Clean up configuration for backtracing when profiling is enabled, and
document the configuration logic in INSTALL.
Disable libgcc-based backtracing except on x64 (where it is known to
work).
Add the --disable-prof-gcc option.