Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.
Add the tcache.create, tcache.flush, and tcache.destroy mallctls.
This resolves#145.
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.
This resolves#172.
Add:
--with-lg-page
--with-lg-page-sizes
--with-lg-size-class-group
--with-lg-quantum
Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.
Fix various edge conditions exposed by the configure options.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind]. Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.
Add a tsd-based arenas_cache, which amortizes arenas reads. This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.
Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).
Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).
Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used. Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes. Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB). Large size classes range from [16 KiB .. 4 MiB). Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).
Don't disable tcache when lazy-lock is configured. There already exists
a mechanism to disable tcache, but doing so automatically due to
lazy-lock causes surprising performance behavior.
Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.). These regressions were twofold:
1) tsd_tryget() should never (and need never) return NULL. Rename it to
tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
because cleanup happens during the nominal-->purgatory transition,
and re-initialization must not happen while in the purgatory state.
Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get()
can still be used as long as no re-initialization that would require
cleanup occurs. This means that e.g. the thread_allocated counter
can be updated unconditionally.
* assertion failure
* malloc_init failure
* malloc not already initialized (in malloc_init)
* running in valgrind
* thread cache disabled at runtime
Clang and GCC already consider a comparison with NULL or -1 to be cold,
so many branches (out-of-memory) are already correctly considered as
cold and marking them is not important.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.
Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.
Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run. In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.
Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
Don't junk fill reallocations for which the request size is less than
the current usable size, but not enough smaller to cause a size class
change. Unlike malloc()/calloc()/realloc(), *allocx() contractually
treats the full usize as the allocation, so a caller can ask for zeroed
memory via mallocx() and a series of rallocx() calls that all specify
MALLOCX_ZERO, and be assured that all newly allocated bytes will be
zeroed and made available to the application without danger of allocator
mutation until the size class decreases enough to cause usize reduction.
Fix a Valgrind integration flaw that caused Valgrind warnings about
reads of uninitialized memory in internal zero-initialized data
structures (relevant to tcache and prof code).
Tighten valgrind integration such that immediately after memory is
validated or zeroed, valgrind is told to forget the memory's 'defined'
state. The only place newly allocated memory should be left marked as
'defined' is in the public functions (e.g. calloc() and realloc()).
Embed the bin index for small page runs into the chunk page map, in
order to omit [...] in the following dependent load sequence:
ptr-->mapelm-->[run-->bin-->]bin_info
Move various non-critcal code out of the inlined function chain into
helper functions (tcache_event_hard(), arena_dalloc_small(), and
locking).
Implement Valgrind support, as well as the redzone and quarantine
features, which help Valgrind detect memory errors. Redzones are only
implemented for small objects because the changes necessary to support
redzones around large and huge objects are complicated by in-place
reallocation, to the point that it isn't clear that the maintenance
burden is worth the incremental improvement to Valgrind support.
Merge arena_salloc() and arena_salloc_demote().
Refactor i[v]salloc() to expose the 'demote' option.
s/PAGE_SHIFT/LG_PAGE/g and s/PAGE_SIZE/PAGE/g.
Remove remnants of the dynamic-page-shift code.
Rename the "arenas.pagesize" mallctl to "arenas.page".
Remove the "arenas.chunksize" mallctl, which is redundant with
"opt.lg_chunk".
glibc uses memalign()/free() to allocate/deallocate TLS, which means
that it is unsafe to set TLS variables as a side effect of free() --
they may already be deallocated. Work around this by avoiding
tcache_create() within free().
Reported by Mike Hommey.
Implement tsd, which is a TLS/TSD abstraction that uses one or both
internally. Modify bootstrapping such that no tsd's are utilized until
allocation is safe.
Remove malloc_[v]tprintf(), and use malloc_snprintf() instead.
Fix %p argument size handling in malloc_vsnprintf().
Fix a long-standing statistics-related bug in the "thread.arena"
mallctl that could cause crashes due to linked list corruption.
Remove the lg_tcache_gc_sweep option, because it is no longer
very useful. Prior to the addition of dynamic adjustment of tcache fill
count, it was possible for fill/flush overhead to be a problem, but this
problem no longer occurs.
Program-generate small size class tables for all valid combinations of
LG_TINY_MIN, LG_QUANTUM, and PAGE_SHIFT. Use the appropriate table to generate
all relevant data structures, and remove the distinction between
tiny/quantum/cacheline/subpage bins.
Remove --enable-dynamic-page-shift. This option didn't prove useful in
practice, and it prevented optimizations.
Add Tilera architecture support.
tcache_get() is inlined, so do the config_tcache check inside
tcache_get() and simplify its callers.
Make arena_malloc() an inline function, since it is part of the malloc()
fast path.
Remove conditional logic that cause build issues if --disable-tcache was
specified.
Remove structure magic, because 1) it is no longer conditional, and 2)
it stopped being very effective at detecting memory corruption several
years ago.
Convert configuration-related cpp conditional logic to use static
constant variables, e.g.:
#ifdef JEMALLOC_DEBUG
[...]
#endif
becomes:
if (config_debug) {
[...]
}
The advantage is clearer, more concise code. The main disadvantage is
that data structures no longer have conditionally defined fields, so
they pay the cost of all fields regardless of whether they are used. In
practice, this is only a minor concern; config_stats will go away in an
upcoming change, and config_prof is the only other major feature that
depends on more than a few special-purpose fields.