When jemalloc is linked into an executable (as opposed to a shared
library), compiling with -fno-pic can have significant advantages,
mainly because we don't have to go throught the GOT (global offset
table).
Users who want to link jemalloc into a shared library that could
be dlopened need to link with libjemalloc_pic.a or libjemalloc.so.
For the non-TLS case (as on OS X), if the "thread.{de,}allocatedp"
mallctl was called before any allocation occurred for that thread, the
TSD was still NULL, thus putting the application at risk of
dereferencing NULL. Fix this by refactoring the initialization code,
and making it part of the conditional logic for all per thread
allocation counter accesses.
Fix ALLOCM_LG_ALIGN to take a parameter and use it. Apparently, an
editing error left ALLOCM_LG_ALIGN with the same definition as
ALLOCM_LG_ALIGN_MASK.
If mremap(2) is available and supports MREMAP_FIXED, use it for huge
realloc().
Initialize rtree later during bootstrapping, so that --enable-debug
--enable-dss works.
Fix a minor swap_avail stats bug.
Replace the single-character run-time flags with key/value pairs, which
can be set via the malloc_conf global, /etc/malloc.conf, and the
MALLOC_CONF environment variable.
Replace the JEMALLOC_PROF_PREFIX environment variable with the
"opt.prof_prefix" option.
Replace umax2s() with u2s().
Fix a regression due to the recent heap profiling accuracy improvements:
prof_{m,re}alloc() must set the object's profiling context regardless of
whether it is sampled.
Fix management of the CHUNK_MAP_CLASS chunk map bits, such that all
large object (re-)allocation paths correctly initialize the bits. Prior
to this fix, in-place realloc() cleared the bits, resulting in incorrect
reported object size from arena_salloc_demote(). After this fix the
non-demoted bit pattern is all zeros (instead of all ones), which makes
it easier to assure that the bits are properly set.
Inline the heap sampling code that is executed for every allocation
event (regardless of whether a sample is taken).
Combine all prof TLS data into a single data structure, in order to
reduce the TLS lookup volume.
Add the "thread.allocated" and "thread.deallocated" mallctls, which can
be used to query the total number of bytes ever allocated/deallocated by
the calling thread.
Add s2u() and sa2u(), which can be used to compute the usable size that
will result from an allocation request of a particular size/alignment.
Re-factor ipalloc() to use sa2u().
Enhance the heap profiler to trigger samples based on usable size,
rather than request size. This has a subtle, but important, impact on
the accuracy of heap sampling. For example, previous to this change,
16- and 17-byte objects were sampled at nearly the same rate, but
17-byte objects actually consume 32 bytes each. Therefore it was
possible for the sample to be somewhat skewed compared to actual memory
usage of the allocated objects.
In arena_ralloc_large_grow(), update the map element for the end of the
newly grown run, rather than the interior map element that was the
beginning of the appended run. This is a long-standing bug, and it had
the potential to cause massive corruption, but triggering it required
roughly the following sequence of events:
1) Large in-place growing realloc(), with left-over space in the run
that followed the large object.
2) Allocation of the remainder run left over from (1).
3) Deallocation of the remainder run *before* deallocation of the
large run, with unfortunate interior map state left over from
previous run allocation/deallocation activity, such that one or
more pages of allocated memory would be treated as part of the
remainder run during run coalescing.
In summary, this was a bad bug, but it was difficult to trigger.
In arena_bin_malloc_hard(), if another thread wins the race to allocate
a bin run, dispose of the spare run via arena_bin_lower_run() rather
than arena_run_dalloc(), since the run has already been prepared for use
as a bin run. This bug has existed since March 14, 2010:
e00572b384
mmap()/munmap() without arena->lock or bin->lock.
Fix bugs in arena_dalloc_bin_run(), arena_trim_head(),
arena_trim_tail(), and arena_ralloc_large_grow() that could cause the
CHUNK_MAP_UNZEROED map bit to become corrupted. These are all
long-standing bugs, but the chances of them actually causing problems
was much lower before the CHUNK_MAP_ZEROED --> CHUNK_MAP_UNZEROED
conversion.
Fix a large run statistics regression in arena_ralloc_large_grow() that
was introduced on September 17, 2010:
8e3c3c61b5
Add {,r,s,d}allocm().
Add debug code to validate that supposedly pre-zeroed memory really is.
Add the R option to control whether cumulative heap profile data
are maintained. Add the T option to control the size of per thread
backtrace caches, primarily because when the R option is specified,
backtraces that no longer have allocations associated with them are
discarded as soon as no thread caches refer to them.
Invert the chunk map bit that tracks whether a page is zeroed, so that
for zeroed arena chunks, the interior of the page map does not need to
be initialized (as it consists entirely of zero bytes).
It is common to have to specify something like JEMALLOC_OPTIONS=F31i,
because interval-based dumps are often unuseful or too expensive.
Therefore, disable interval-based dumps by default. To get the previous
default behavior it is now necessary to specify 31I as part of the
options.
Use INT_MAX instead of MAX_INT in ALLOCM_ALIGN(), and #include
<limits.h> in order to get its definition.
Modify prof code related to hash tables to avoid aliasing warnings from
gcc 4.1.2 (gcc 4.4.0 and 4.4.3 do not warn).
Add allocm(), rallocm(), sallocm(), and dallocm(), which are a
functional superset of malloc(), calloc(), posix_memalign(),
malloc_usable_size(), and free().
Use the size argument to tcache_dalloc_large() to control the number of
bytes set to 0x5a when junk filling is enabled, rather than accessing a
non-existent arena bin. This bug was capable of corrupting an
arbitrarily large memory region, depending on what followed the arena
data structure in memory (typically zeroed memory, another arena_t, or a
red-black tree node for a huge object).
Initialize bt2cnt_tsd so that cleanup at thread exit actually happens.
Associate (prof_ctx_t *) with allocated objects, rather than
(prof_thr_cnt_t *). Each thread must always operate on its own
(prof_thr_cnt_t *), and an object may outlive the thread that allocated it.
Add the E/e options to control whether the application starts with
sampling active/inactive (secondary control to F/f). Add the
prof.active mallctl so that the application can activate/deactivate
sampling on the fly.
Make it possible to disable interval-triggered profile dumping, even if
profiling is enabled. This is useful if the user only wants a single
dump at exit, or if the application manually triggers profile dumps.
If the mean heap sampling interval is larger than one page, simulate
sampled small objects with large objects. This allows profiling context
pointers to be omitted for small objects. As a result, the memory
overhead for sampling decreases as the sampling interval is increased.
Fix a compilation error in the profiling code.
Remove medium size classes, because concurrent dirty page purging is
no longer capable of purging inactive dirty pages inside active runs
(due to recent arena/bin locking changes).
Enhance tcache to support caching large objects, so that the same range
of size classes is still cached, despite the removal of medium size
class support.
For bin-related allocation, protect data structures with bin locks
rather than arena locks. Arena locks remain for run
allocation/deallocation and other miscellaneous operations.
Restructure statistics counters to maintain per bin
allocated/nmalloc/ndalloc, but continue to provide arena-wide statistics
via aggregation in the ctl code.
Use chains of cached objects, rather than using arrays of pointers.
Since tcache_bin_t is no longer dynamically sized, convert tcache_t's
tbin to an array of structures, rather than an array of pointers. This
implicitly removes tcache_bin_{create,destroy}(), which further
simplifies the fast path for malloc/free.
Use cacheline alignment for tcache_t allocations.
Remove runtime configuration option for number of tcache bin slots, and
replace it with a boolean option for enabling/disabling tcache.
Limit the number of tcache objects to the lesser of TCACHE_NSLOTS_MAX
and 2X the number of regions per run for the size class.
For GC-triggered flush, discard 3/4 of the objects below the low water
mark, rather than 1/2.
Convert chunks_dirty from a red-black tree to a doubly linked list,
and use it to purge dirty pages from chunks in FIFO order.
Add a lock around the code that purges dirty pages via madvise(2), in
order to avoid kernel contention. If lock acquisition fails,
indefinitely postpone purging dirty pages.
Add a lower limit of one chunk worth of dirty pages per arena for
purging, in addition to the active:dirty ratio.
When purging, purge all dirty pages from at least one chunk, but rather
than purging enough pages to drop to half the purging threshold, merely
drop to the threshold.