Add inline assembly implementations of atomic_{add,sub}_uint{32,64}()
for x86/x64, in order to support compilers that are missing the relevant
gcc intrinsics.
sa2u() returns 0 on overflow, but the profiling code was blindly calling
sa2u() and allowing the error to silently propagate, ultimately ending
in a later assertion failure. Refactor all ipalloc() callers to call
sa2u(), check for overflow before calling ipalloc(), and pass usize
rather than size. This allows ipalloc() to avoid calling sa2u() in the
common case.
Fix a regression due to:
Remove an arena_bin_run_size_calc() constraint.
2a6f2af6e4
The removed constraint required that small run headers fit in one page,
which indirectly limited runs such that they would not cause overflow in
arena_run_regind(). Add an explicit constraint to
arena_bin_run_size_calc() based on the largest number of regions that
arena_run_regind() can handle (2^11 as currently configured).
Dynamically adjust tcache fill count (number of objects allocated per
tcache refill) such that if GC has to flush inactive objects, the fill
count gradually decreases. Conversely, if refills occur while the fill
count is depressed, the fill count gradually increases back to its
maximum value.
pthread_mutex_lock() can call malloc() on OS X (!!!), which causes
deadlock. Work around this by using spinlocks that are built of more
primitive stuff.
Add the "stats.cactive" mallctl, which can be used to efficiently and
repeatedly query approximately how much active memory the application is
utilizing.
Rather than blindly assigning threads to arenas in round-robin fashion,
choose the lowest-numbered arena that currently has the smallest number
of threads assigned to it.
Add the "stats.arenas.<i>.nthreads" mallctl.
The previous free list implementation, which embedded singly linked
lists in available regions, had the unfortunate side effect of causing
many cache misses during thread cache fills. Fix this in two places:
- arena_run_t: Use a new bitmap implementation to track which regions
are available. Furthermore, revert to preferring the
lowest available region (as jemalloc did with its old
bitmap-based approach).
- tcache_t: Move read-only tcache_bin_t metadata into
tcache_bin_info_t, and add a contiguous array of pointers
to tcache_t in order to track cached objects. This
substantially increases the size of tcache_t, but results
in much higher data locality for common tcache operations.
As a side benefit, it is again possible to efficiently
flush the least recently used cached objects, so this
change changes flushing from MRU to LRU.
The new bitmap implementation uses a multi-level summary approach to
make finding the lowest available region very fast. In practice,
bitmaps only have one or two levels, though the implementation is
general enough to handle extremely large bitmaps, mainly so that large
page sizes can still be entertained.
Fix tcache_bin_flush_large() to always flush statistics, in the same way
that tcache_bin_flush_small() was recently fixed.
Use JEMALLOC_DEBUG rather than NDEBUG.
Add dassert(), and use it for debug-only asserts.
Clean up configuration for backtracing when profiling is enabled, and
document the configuration logic in INSTALL.
Disable libgcc-based backtracing except on x64 (where it is known to
work).
Add the --disable-prof-gcc option.
Convert all direct small_size2bin[...] accesses to SMALL_SIZE2BIN(...)
macro calls, and use a couple of cheap math operations to allow
compacting the table by 4X or 8X, on 32- and 64-bit systems,
respectively.
When jemalloc is linked into an executable (as opposed to a shared
library), compiling with -fno-pic can have significant advantages,
mainly because we don't have to go throught the GOT (global offset
table).
Users who want to link jemalloc into a shared library that could
be dlopened need to link with libjemalloc_pic.a or libjemalloc.so.
For the non-TLS case (as on OS X), if the "thread.{de,}allocatedp"
mallctl was called before any allocation occurred for that thread, the
TSD was still NULL, thus putting the application at risk of
dereferencing NULL. Fix this by refactoring the initialization code,
and making it part of the conditional logic for all per thread
allocation counter accesses.
Fix ALLOCM_LG_ALIGN to take a parameter and use it. Apparently, an
editing error left ALLOCM_LG_ALIGN with the same definition as
ALLOCM_LG_ALIGN_MASK.
If mremap(2) is available and supports MREMAP_FIXED, use it for huge
realloc().
Initialize rtree later during bootstrapping, so that --enable-debug
--enable-dss works.
Fix a minor swap_avail stats bug.
Replace the single-character run-time flags with key/value pairs, which
can be set via the malloc_conf global, /etc/malloc.conf, and the
MALLOC_CONF environment variable.
Replace the JEMALLOC_PROF_PREFIX environment variable with the
"opt.prof_prefix" option.
Replace umax2s() with u2s().
Fix a regression due to the recent heap profiling accuracy improvements:
prof_{m,re}alloc() must set the object's profiling context regardless of
whether it is sampled.
Fix management of the CHUNK_MAP_CLASS chunk map bits, such that all
large object (re-)allocation paths correctly initialize the bits. Prior
to this fix, in-place realloc() cleared the bits, resulting in incorrect
reported object size from arena_salloc_demote(). After this fix the
non-demoted bit pattern is all zeros (instead of all ones), which makes
it easier to assure that the bits are properly set.
Inline the heap sampling code that is executed for every allocation
event (regardless of whether a sample is taken).
Combine all prof TLS data into a single data structure, in order to
reduce the TLS lookup volume.
Add the "thread.allocated" and "thread.deallocated" mallctls, which can
be used to query the total number of bytes ever allocated/deallocated by
the calling thread.
Add s2u() and sa2u(), which can be used to compute the usable size that
will result from an allocation request of a particular size/alignment.
Re-factor ipalloc() to use sa2u().
Enhance the heap profiler to trigger samples based on usable size,
rather than request size. This has a subtle, but important, impact on
the accuracy of heap sampling. For example, previous to this change,
16- and 17-byte objects were sampled at nearly the same rate, but
17-byte objects actually consume 32 bytes each. Therefore it was
possible for the sample to be somewhat skewed compared to actual memory
usage of the allocated objects.
In arena_ralloc_large_grow(), update the map element for the end of the
newly grown run, rather than the interior map element that was the
beginning of the appended run. This is a long-standing bug, and it had
the potential to cause massive corruption, but triggering it required
roughly the following sequence of events:
1) Large in-place growing realloc(), with left-over space in the run
that followed the large object.
2) Allocation of the remainder run left over from (1).
3) Deallocation of the remainder run *before* deallocation of the
large run, with unfortunate interior map state left over from
previous run allocation/deallocation activity, such that one or
more pages of allocated memory would be treated as part of the
remainder run during run coalescing.
In summary, this was a bad bug, but it was difficult to trigger.
In arena_bin_malloc_hard(), if another thread wins the race to allocate
a bin run, dispose of the spare run via arena_bin_lower_run() rather
than arena_run_dalloc(), since the run has already been prepared for use
as a bin run. This bug has existed since March 14, 2010:
e00572b384
mmap()/munmap() without arena->lock or bin->lock.
Fix bugs in arena_dalloc_bin_run(), arena_trim_head(),
arena_trim_tail(), and arena_ralloc_large_grow() that could cause the
CHUNK_MAP_UNZEROED map bit to become corrupted. These are all
long-standing bugs, but the chances of them actually causing problems
was much lower before the CHUNK_MAP_ZEROED --> CHUNK_MAP_UNZEROED
conversion.
Fix a large run statistics regression in arena_ralloc_large_grow() that
was introduced on September 17, 2010:
8e3c3c61b5
Add {,r,s,d}allocm().
Add debug code to validate that supposedly pre-zeroed memory really is.
Add the R option to control whether cumulative heap profile data
are maintained. Add the T option to control the size of per thread
backtrace caches, primarily because when the R option is specified,
backtraces that no longer have allocations associated with them are
discarded as soon as no thread caches refer to them.
Invert the chunk map bit that tracks whether a page is zeroed, so that
for zeroed arena chunks, the interior of the page map does not need to
be initialized (as it consists entirely of zero bytes).
It is common to have to specify something like JEMALLOC_OPTIONS=F31i,
because interval-based dumps are often unuseful or too expensive.
Therefore, disable interval-based dumps by default. To get the previous
default behavior it is now necessary to specify 31I as part of the
options.
Use INT_MAX instead of MAX_INT in ALLOCM_ALIGN(), and #include
<limits.h> in order to get its definition.
Modify prof code related to hash tables to avoid aliasing warnings from
gcc 4.1.2 (gcc 4.4.0 and 4.4.3 do not warn).
Add allocm(), rallocm(), sallocm(), and dallocm(), which are a
functional superset of malloc(), calloc(), posix_memalign(),
malloc_usable_size(), and free().
Use the size argument to tcache_dalloc_large() to control the number of
bytes set to 0x5a when junk filling is enabled, rather than accessing a
non-existent arena bin. This bug was capable of corrupting an
arbitrarily large memory region, depending on what followed the arena
data structure in memory (typically zeroed memory, another arena_t, or a
red-black tree node for a huge object).