Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether. Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.
The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}
This resolves#521.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
This fixes tcache_flush for manual tcaches, which wasn't able to find
the correct arena it associated with. Also changed the decay test to
cover this case (by using manually created arenas).
Fix compute_size_with_overflow() to use a high_bits mask that has the
high bits set, rather than the low bits. This regression was introduced
by 5154ff32ee (Unify the allocation
paths).
Synchronize tcaches with tcaches_mtx rather than ctl_mtx. Add missing
synchronization for tcache flushing. This bug was introduced by
1cb181ed63 (Implement explicit tcache
support.), which was first released in 4.0.0.
In the refactoring that unified the allocation paths, usize was substituted for
size. This worked fine under the default test configuration, but triggered
asserts when we started beefing up our CI testing.
This change fixes the issue, and clarifies the comment describing the argument
selection that it got wrong.
Avoid the name secure_getenv to avoid redeclaring secure_getenv when
secure_getenv is present but its use is manually disabled via
ac_cv_func_secure_getenv=no.
This unifies the allocation paths for malloc, posix_memalign, aligned_alloc,
calloc, memalign, valloc, and mallocx, so that they all share common code where
they can.
There's more work that could be done here, but I think this is the smallest
discrete change in this direction.
Add/rename related mallctls:
- Add stats.arenas.<i>.base .
- Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal .
- Add stats.arenas.<i>.resident .
Modify the arenas.extend mallctl to take an optional (extent_hooks_t *)
argument so that it is possible for all base allocations to be serviced
by the specified extent hooks.
This resolves#463.
Some versions of Android provide a pthreads library without providing
pthread_atfork(), so in practice a separate feature test is necessary
for the latter.
Add an "over-size" extent heap in which to store extents which exceed
the maximum size class (plus cache-oblivious padding, if enabled).
Remove psz2ind_clamp() and use psz2ind() instead so that trying to
allocate the maximum size class can in principle succeed. In practice,
this allows assertions to hold so that OOM errors can be successfully
generated.
This works around malloc_conf not being properly initialized by at least
the cygwin toolchain. Prior build system changes to use
-Wl,--[no-]whole-archive may be necessary for malloc_conf resolution to
work properly as a non-weak symbol (not tested).
This is generally correct (no need for weak symbols since no jemalloc
library is involved in the link phase), and avoids linking problems
(apparently unininitialized non-NULL malloc_conf) when using cygwin with
gcc.
glibc defines its malloc implementation with several weak and strong
symbols:
strong_alias (__libc_calloc, __calloc) weak_alias (__libc_calloc, calloc)
strong_alias (__libc_free, __cfree) weak_alias (__libc_free, cfree)
strong_alias (__libc_free, __free) strong_alias (__libc_free, free)
strong_alias (__libc_malloc, __malloc) strong_alias (__libc_malloc, malloc)
The issue is not with the weak symbols, but that other parts of glibc
depend on __libc_malloc explicitly. Defining them in terms of jemalloc
API's allows the linker to drop glibc's malloc.o completely from the link,
and static linking no longer results in symbol collisions.
Another wrinkle: jemalloc during initialization calls sysconf to
get the number of CPU's. GLIBC allocates for the first time before
setting up isspace (and other related) tables, which are used by
sysconf. Instead, use the pthread API to get the number of
CPUs with GLIBC, which seems to work.
This resolves#442.
Rather than protecting dss operations with a mutex, use atomic
operations. This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
extent_in_dss() and the newly added extent_dss_mergeable(), which can be
called multiple times during extent deallocations.
This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.
This resolves#425.
Add spin_t and spin_{init,adaptive}(), which provide a simple
abstraction for adaptive spinning.
Adaptively spin during busy waits in bootstrapping and rtree node
initialization.
rallocx() for an alignment-constrained request may end up with a
smaller-than-worst-case size if in-place reallocation succeeds due to
serendipitous alignment. In such cases, sampling may not happen.
Look up chunk metadata via the radix tree, rather than using
CHUNK_ADDR2BASE().
Propagate pointer's containing extent.
Minimize extent lookups by doing a single lookup (e.g. in free()) and
propagating the pointer's extent into nearly all the functions that may
need it.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements. This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size. Note that pszind_t and
szind_t are not interchangeable.