If multiple threads race to initialize malloc, the loser(s) busy-wait
until initialization is complete. Add a missing mutex lock so that the
loser(s) properly release the initialization mutex. Under some
race conditions, this flaw could have caused one or more threads to
become permanently blocked.
Reported by Terrell Magee.
Fix the libunwind version of prof_backtrace() to set the backtrace depth
for all possible code paths. This fixes the zero-length backtrace
problem when using libunwind.
When heap profiling is enabled but deactivated, there is no need to call
isalloc(ptr) in prof_{malloc,realloc}(). Avoid these calls, so that
profiling overhead under such conditions is negligible.
If there is more than one arena, initialize next_arena so that the
first and second threads to allocate memory use arenas 0 and 1, rather
than both using arena 0.
Use the size argument to tcache_dalloc_large() to control the number of
bytes set to 0x5a when junk filling is enabled, rather than accessing a
non-existent arena bin. This bug was capable of corrupting an
arbitrarily large memory region, depending on what followed the arena
data structure in memory (typically zeroed memory, another arena_t, or a
red-black tree node for a huge object).
Properly maintain tcache_bin_t's avail pointer such that it is NULL if
no objects are cached. This only caused problems during thread cache
destruction, since cache flushing otherwise never occurs on an empty
bin.
Fix arena_chunk_dealloc() to put the new spare in a consistent state before
dropping the arena mutex to deallocate the previous spare.
Fix arena_run_dalloc() to insert a newly dirtied chunk into the
chunks_dirty list before potentially deallocating the chunk, so that dirty
page accounting is self-consistent.
Initialize bt2cnt_tsd so that cleanup at thread exit actually happens.
Associate (prof_ctx_t *) with allocated objects, rather than
(prof_thr_cnt_t *). Each thread must always operate on its own
(prof_thr_cnt_t *), and an object may outlive the thread that allocated it.
Now that JEMALLOC_OPTIONS=P isn't the only way to cause stats_print() to
be called, opt_stats_print must actually be checked when reporting the
state of the P/p option.
Don't build with -march=native by default, because the generated code
may perform especially poorly on ABI-compatible, but internally
different, systems.
Fix divide-by-zero error in pprof. It is possible for sample contexts
to currently have no associated objects, but the cumulative statistics
are still useful, depending on how the user invokes pprof. Since
jemalloc intentionally does not filter such contexts, take care not to
divide by 0 when re-scaling for v2 heap sampling.
Install pprof as part of 'make install'.
Update pprof documentation.
Add the E/e options to control whether the application starts with
sampling active/inactive (secondary control to F/f). Add the
prof.active mallctl so that the application can activate/deactivate
sampling on the fly.
Make it possible to disable interval-triggered profile dumping, even if
profiling is enabled. This is useful if the user only wants a single
dump at exit, or if the application manually triggers profile dumps.
If the mean heap sampling interval is larger than one page, simulate
sampled small objects with large objects. This allows profiling context
pointers to be omitted for small objects. As a result, the memory
overhead for sampling decreases as the sampling interval is increased.
Fix a compilation error in the profiling code.
Properly set/clear CHUNK_MAP_ZEROED for all purged pages, according to
whether the pages are (potentially) file-backed or anonymous. This was
merely a performance pessimization for the anonymous mapping case, but
was a calloc()-related bug for the swap_enabled case.
Remove medium size classes, because concurrent dirty page purging is
no longer capable of purging inactive dirty pages inside active runs
(due to recent arena/bin locking changes).
Enhance tcache to support caching large objects, so that the same range
of size classes is still cached, despite the removal of medium size
class support.
Initialize small run header before dropping arena->lock,
arena_chunk_purge() relies on valid small run headers during run
iteration.
Add some assertions.
For bin-related allocation, protect data structures with bin locks
rather than arena locks. Arena locks remain for run
allocation/deallocation and other miscellaneous operations.
Restructure statistics counters to maintain per bin
allocated/nmalloc/ndalloc, but continue to provide arena-wide statistics
via aggregation in the ctl code.
Use chains of cached objects, rather than using arrays of pointers.
Since tcache_bin_t is no longer dynamically sized, convert tcache_t's
tbin to an array of structures, rather than an array of pointers. This
implicitly removes tcache_bin_{create,destroy}(), which further
simplifies the fast path for malloc/free.
Use cacheline alignment for tcache_t allocations.
Remove runtime configuration option for number of tcache bin slots, and
replace it with a boolean option for enabling/disabling tcache.
Limit the number of tcache objects to the lesser of TCACHE_NSLOTS_MAX
and 2X the number of regions per run for the size class.
For GC-triggered flush, discard 3/4 of the objects below the low water
mark, rather than 1/2.
Convert chunks_dirty from a red-black tree to a doubly linked list,
and use it to purge dirty pages from chunks in FIFO order.
Add a lock around the code that purges dirty pages via madvise(2), in
order to avoid kernel contention. If lock acquisition fails,
indefinitely postpone purging dirty pages.
Add a lower limit of one chunk worth of dirty pages per arena for
purging, in addition to the active:dirty ratio.
When purging, purge all dirty pages from at least one chunk, but rather
than purging enough pages to drop to half the purging threshold, merely
drop to the threshold.
Don't look for a shared libunwind if --with-static-libunwind is
specified.
Set SONAME when linking the shared libjemalloc.
Add DESTDIR support.
Add install_{include,lib/man} build targets.
Clean up compiler flag configuration.
Use left-leaning 2-3 red-black trees instead of left-leaning 2-3-4
red-black trees. This reduces maximum tree height from (3 lg n) to
(2 lg n).
Do lazy balance fixup, rather than transforming the tree during the down
pass. This improves insert/remove speed by ~30%.
Use callback-based iteration rather than macros.
Remove all functionality related to tracing. This functionality was
useful for understanding memory fragmentation during early algorithmic
design of jemalloc, but it had little utility for non-trivial
applications, due to the sheer volume of data written to disk.
Add the --disable-prof-libgcc configure option, and add backtracing
based on libgcc, which is used by default.
Fix a bug in hash().
Fix various configuration-dependent compilation errors.
If a custom small_size2bin table was required due to non-default size
class settings, memory allocation prior to initializing chunk parameters
would cause a crash due to division by 0. The fix re-orders the various
*_boot() function calls.
Bootstrapping is simpler now than it was before the base allocator
started just using the chunk allocator directly. This allows
arena_boot[01]() to be combined.
Add error detection for pthread_atfork() and atexit() function calls.
Fix a type mismatch for "arenas.nlruns" mallctl access. This bug caused
a crash during statistics printing on 64-bit systems.
Fix the "stats.active" mallctl to include active memory in huge objects.
Report active bytes for the whole application, as well as per arena.
Remove several unused variables.
A missing 'else' in chunk_alloc_mmap() caused an extra chunk to be
allocated every time the optimistic alignment path was entered, since
the following block would always be executed immediately afterward.
This chunk leak caused no increase in physical memory usage, but virtual
memory could grow until resource exaustion caused allocation failures.
Replace chunk stats code that was missing locking; this fixes a race
condition that could corrupt chunk statistics.
Converting malloc_stats_print() to use mallctl*().
Add a missing semicolon in th DSS code.
Convert malloc_tcache_flush() to a mallctl.
Convert malloc_swap_enable() to a set of mallctl's.
Use optional zeroing in arena_chunk_alloc() to avoid needless zeroing of
chunks. This is particularly important in the context of swapfile and
DSS allocation, since a long-lived application may commonly recycle
chunks.
Clean up whitespace.
Lock access of swap_avail when printing stats.
Use inttypes.h for portable printf() format specifiers, specifically for
uint64_t (PRIu64).
Change highchunks and curchunks stats from (unsigned long) to (size_t).
Fix a stats bug in large object curruns accounting.
Replace tcache_bin_fill() with arena_tcache_fill(), and fix a bug in an OOM
error path.
Fix API name mangling to coexist with __attribute__((malloc)).
Enhance bin run deallocation to avoid marking all pages as dirty, since the
dirty bits are already correct for all but the first page, due to the use of
arena_run_rc_{incr,decr}(). This tends to dramatically reduce the number of
pages that are marked dirty.
Modify arena_bin_run_size_calc() to assure that bin run headers never exceed
one page. In practice, this can't happen unless hard-coded constants (related
to RUN_MAX_OVRHD) are modified, but the dirty page tracking code assumes bin
run headers never extend past the first page, so it seems worth making this a
universally valid assumption.
Use JEMALLOC_ATTR(tls_model("initial-exec)) instead of -ftls-model=initial-exec
so that libjemalloc_pic.a can be directly linked into another library without
needing linker options changes.
Add attributes to malloc, calloc, and posix_memalign, for compatibility with
glibc's declarations.
Add function prototypes for the standard malloc(3) API.
Add the 'G'/'g' and 'H'/'h' MALLOC_OPTIONS flags.
Add the malloc_tcache_flush() function.
Disable thread-specific caching until the application goes multi-threaded.
Add the 'M' and 'm' MALLOC_OPTIONS flags, which control the maximum medium size
class.
Relax the cap on small/medium run size to arena_maxclass.
Reduce arena_run_reg_dalloc() integer division code complexity.
Increase the default chunk size from 1MiB to 4MiB.
implementation, calls free() after calling TSD destructors. This was causing a
crash during thread exit, since the magazine rack was no longer valid for the
thread. Fix this by using a special mag_rack value to indicate that
deallocation should bypass the magazine machinery.
jemalloc is configured.
Modify arena_malloc() API to avoid unnecessary choose_arena() calls. Remove
unnecessary code from choose_arena().
Enable lazy-lock by default, now that choose_arena() is both faster and out of
the critical path.
Implement objdir support in the build system.
Implement minimal Makefile.
Make compile-time-optional jemalloc features controllable via configure
options (debug, stats, tiny, mag, balance, dss).
Conditionally exclude most of the opt_* run-time options, based on configure
options (fill, xmalloc, sysv).
Implement optional --enable-dynamic-page-shift.
Implement optional --enable-lazy-lock.
Re-order malloc_init_hard() and use the malloc_initializer variable to support
recursive allocation in malloc_ncpus().
Add mag_rack_tsd in order to receive notifications of thread termination.
Add jemalloc.h.