Add inline assembly implementations of atomic_{add,sub}_uint{32,64}()
for x86/x64, in order to support compilers that are missing the relevant
gcc intrinsics.
sa2u() returns 0 on overflow, but the profiling code was blindly calling
sa2u() and allowing the error to silently propagate, ultimately ending
in a later assertion failure. Refactor all ipalloc() callers to call
sa2u(), check for overflow before calling ipalloc(), and pass usize
rather than size. This allows ipalloc() to avoid calling sa2u() in the
common case.
Fix a regression due to:
Remove an arena_bin_run_size_calc() constraint.
2a6f2af6e4
The removed constraint required that small run headers fit in one page,
which indirectly limited runs such that they would not cause overflow in
arena_run_regind(). Add an explicit constraint to
arena_bin_run_size_calc() based on the largest number of regions that
arena_run_regind() can handle (2^11 as currently configured).
Dynamically adjust tcache fill count (number of objects allocated per
tcache refill) such that if GC has to flush inactive objects, the fill
count gradually decreases. Conversely, if refills occur while the fill
count is depressed, the fill count gradually increases back to its
maximum value.
pthread_mutex_lock() can call malloc() on OS X (!!!), which causes
deadlock. Work around this by using spinlocks that are built of more
primitive stuff.
Add the "stats.cactive" mallctl, which can be used to efficiently and
repeatedly query approximately how much active memory the application is
utilizing.
Rather than blindly assigning threads to arenas in round-robin fashion,
choose the lowest-numbered arena that currently has the smallest number
of threads assigned to it.
Add the "stats.arenas.<i>.nthreads" mallctl.
The previous free list implementation, which embedded singly linked
lists in available regions, had the unfortunate side effect of causing
many cache misses during thread cache fills. Fix this in two places:
- arena_run_t: Use a new bitmap implementation to track which regions
are available. Furthermore, revert to preferring the
lowest available region (as jemalloc did with its old
bitmap-based approach).
- tcache_t: Move read-only tcache_bin_t metadata into
tcache_bin_info_t, and add a contiguous array of pointers
to tcache_t in order to track cached objects. This
substantially increases the size of tcache_t, but results
in much higher data locality for common tcache operations.
As a side benefit, it is again possible to efficiently
flush the least recently used cached objects, so this
change changes flushing from MRU to LRU.
The new bitmap implementation uses a multi-level summary approach to
make finding the lowest available region very fast. In practice,
bitmaps only have one or two levels, though the implementation is
general enough to handle extremely large bitmaps, mainly so that large
page sizes can still be entertained.
Fix tcache_bin_flush_large() to always flush statistics, in the same way
that tcache_bin_flush_small() was recently fixed.
Use JEMALLOC_DEBUG rather than NDEBUG.
Add dassert(), and use it for debug-only asserts.
Clean up configuration for backtracing when profiling is enabled, and
document the configuration logic in INSTALL.
Disable libgcc-based backtracing except on x64 (where it is known to
work).
Add the --disable-prof-gcc option.
Fix the automatic header dependency generation to handle the .pic.o
suffix. This regression was due to:
Build both PIC and no PIC static libraries
af5d6987f8
Convert all direct small_size2bin[...] accesses to SMALL_SIZE2BIN(...)
macro calls, and use a couple of cheap math operations to allow
compacting the table by 4X or 8X, on 32- and 64-bit systems,
respectively.
Compile with -fvisibility=hidden rather than -fvisibility=internal, in
order to avoid PLT lookups for internal functions. Also fix a
regression that caused the -fvisibility flag to be omitted, due to:
Port to Mac OS X.
2dbecf1f62
When a thread cache flushes objects to their arenas due to an abundance
of cached objects, it merges the allocation request count for the
associated size class, and increments a flush counter. If none of the
flushed objects came from the thread's assigned arena, then the merging
wouldn't happen (though the counter would typically eventually be
merged), nor would the flush counter be incremented (a hard bug). Fix
this via extra conditional code just after the flush loop.
When jemalloc is linked into an executable (as opposed to a shared
library), compiling with -fno-pic can have significant advantages,
mainly because we don't have to go throught the GOT (global offset
table).
Users who want to link jemalloc into a shared library that could
be dlopened need to link with libjemalloc_pic.a or libjemalloc.so.
For the non-TLS case (as on OS X), if the "thread.{de,}allocatedp"
mallctl was called before any allocation occurred for that thread, the
TSD was still NULL, thus putting the application at risk of
dereferencing NULL. Fix this by refactoring the initialization code,
and making it part of the conditional logic for all per thread
allocation counter accesses.
Fix huge_ralloc() to call huge_palloc() only if alignment requires it.
This bug caused under-sized allocation for aligned huge reallocation
(via rallocm()) if the requested alignment was less than the chunk size
(4 MiB by default).
Fix ALLOCM_LG_ALIGN to take a parameter and use it. Apparently, an
editing error left ALLOCM_LG_ALIGN with the same definition as
ALLOCM_LG_ALIGN_MASK.
Restructure the ctx initialization code such that the ctx isn't locked
across portions of the initialization code where allocation could occur.
Instead artificially inflate the cnt_merged.curobjs field, just as is
done elsewhere to avoid similar races to the one that would otherwise be
created by the reduction in locking scope.
This bug affected interval- and growth-triggered heap dumping, but not
manual heap dumping.
When setting a new arena association for the calling thread, also update
the tcache's cached arena pointer, primarily so that
tcache_alloc_small_hard() uses the intended arena.
Remove the constraint that small run headers fit in one page. This
constraint was necessary to avoid dirty page purging issues for unused
pages within runs for medium size classes (which no longer exist).