Reduce rtree memory usage by storing booleans (1 byte each) rather than
pointers. The rtree code is only used to record whether jemalloc manages
a chunk of memory, so there's no need to store pointers in the rtree.
Increase rtree node size to 64 KiB in order to reduce tree depth from 13
to 3 on 64-bit systems. The conversion to more compact leaf nodes was
enough by itself to make the rtree depth 1 on 32-bit systems; due to the
fact that root nodes are smaller than the specified node size if
possible, the node size change has no impact on 32-bit systems (assuming
default chunk size).
Verify that freed regions are quarantined, and that redzone corruption
is detected.
Introduce a testing idiom for intercepting/replacing internal functions.
In this case the replaced function is ordinarily a static function, but
the idiom should work similarly for library-private functions.
Don't junk fill reallocations for which the request size is less than
the current usable size, but not enough smaller to cause a size class
change. Unlike malloc()/calloc()/realloc(), *allocx() contractually
treats the full usize as the allocation, so a caller can ask for zeroed
memory via mallocx() and a series of rallocx() calls that all specify
MALLOCX_ZERO, and be assured that all newly allocated bytes will be
zeroed and made available to the application without danger of allocator
mutation until the size class decreases enough to cause usize reduction.
Refactor such that arena_prof_ctx_set() receives usize as an argument,
and use it to determine whether to handle ptr as a small region, rather
than reading the chunk page map.
Implement the *allocx() API, which is a successor to the *allocm() API.
The *allocx() functions are slightly simpler to use because they have
fewer parameters, they directly return the results of primary interest,
and mallocx()/rallocx() avoid the strict aliasing pitfall that
allocm()/rallocx() share with posix_memalign(). The following code
violates strict aliasing rules:
foo_t *foo;
allocm((void **)&foo, NULL, 42, 0);
whereas the following is safe:
foo_t *foo;
void *p;
allocm(&p, NULL, 42, 0);
foo = (foo_t *)p;
mallocx() does not have this problem:
foo_t *foo = (foo_t *)mallocx(42, 0);
Add JEMALLOC_INLINE_C and use it instead of JEMALLOC_INLINE in .c files,
so that the annotated functions are always static.
Remove SFMT's inline-related macros and use jemalloc's instead, so that
there's no danger of interactions with jemalloc's definitions that
disable inlining for debug builds.
Refactor tests to use explicit testing assertions, rather than diff'ing
test output. This makes the test code a bit shorter, more explicitly
encodes testing intent, and makes test failure diagnosis more
straightforward.
Fix malloc_tsd_dalloc() to bypass tcache when dallocating, so that there
is no danger of causing tcache reincarnation during thread exit.
Whether this infinite loop occurs depends on the pthreads TSD
implementation; it is known to occur on Solaris.
Submitted by Markus Eberspächer.
When using LinuxThreads pthread_setspecific triggers recursive
allocation on all threads. Work around this by creating a global linked
list of in-progress tsd initializations.
This modifies the _tsd_get_wrapper macro-generated function. When it has
to initialize an TSD object it will push the item to the linked list
first. If this causes a recursive allocation then the _get_wrapper
request is satisfied from the list. When pthread_setspecific returns the
item is removed from the list.
This effectively adds a very poor substitute for real TLS used only
during pthread_setspecific allocation recursion.
Signed-off-by: Crestez Dan Leonard <lcrestez@ixiacom.com>
Add a missing mutex unlock in a malloc_init_hard() error path (failed
mutex initialization). In practice this bug was very unlikely to ever
trigger, but if it did, application deadlock would likely result.
Reported by Pat Lynch.
Fix a compiler warning in chunk_record() that was due to reading node
rather than xnode. In practice this did not cause any correctness
issue, but dataflow analysis in some compilers cannot tell that node and
xnode are always equal in cases that the read is reached.
Fix a race condition in the "arenas.extend" mallctl that could lead to
internal data structure corruption. The race could be hit if one
thread called the "arenas.extend" mallctl while another thread
concurrently triggered initialization of one of the lazily created
arenas.
Fix a Valgrind integration flaw that caused Valgrind warnings about
reads of uninitialized memory in internal zero-initialized data
structures (relevant to tcache and prof code).
Add the JEMALLOC_ALWAYS_INLINE_C macro and use it for always-inlined
functions declared in .c files. This fixes a function attribute
inconsistency for debug builds that resulted in (harmless) compiler
warnings about functions not being inlinable.
Reported by Ricardo Nabinger Sanchez.
Fix chunk_record() to unlock chunks_mtx before deallocating a base
node, in order to avoid potential deadlock. This fix addresses the
second of two similar bugs.
Fix a chunk recycling bug that could cause the allocator to lose track
of whether a chunk was zeroed. On FreeBSD, NetBSD, and OS X, it could
cause corruption if allocating via sbrk(2) (unlikely unless running with
the "dss:primary" option specified). This was completely harmless on
Linux unless using mlockall(2) (and unlikely even then, unless the
--disable-munmap configure option or the "dss:primary" option was
specified). This regression was introduced in 3.1.0 by the
mlockall(2)/madvise(2) interaction fix.
Internal reallocation of the quarantined object array leaked the old array.
Reallocation failure for internal reallocation of the quarantined object
array (very unlikely) resulted in memory corruption.
Avoid writing to uninitialized TLS as a side effect of deallocation.
Initializing TLS during deallocation is unsafe because it is possible
that a thread never did any allocation, and that TLS has already been
deallocated by the threads library, resulting in write-after-free
corruption. These fixes affect prof_tdata and quarantine; all other
uses of TLS are already safe, whether intentionally (as for tcache) or
unintentionally (as for arenas).
Revert refactoring of opt_abort and opt_junk declarations. clang
accepts the config_*-based declarations (and generates correct code),
but gcc complains with:
error: initializer element is not constant
Update hash from MurmurHash2 to MurmurHash3, primarily because the
latter generates 128 bits in a single call for no extra cost, which
simplifies integration with cuckoo hashing.
Tighten valgrind integration such that immediately after memory is
validated or zeroed, valgrind is told to forget the memory's 'defined'
state. The only place newly allocated memory should be left marked as
'defined' is in the public functions (e.g. calloc() and realloc()).
Move validation of supposedly zeroed pages from chunk_alloc() to
chunk_recycle(). There is little point to validating newly mapped
memory returned by chunk_alloc_mmap(), and memory that comes from sbrk()
is explicitly zeroed, so there is little risk to assuming that
chunk_alloc_dss() actually does the zeroing properly.
This relaxation of validation can make a big difference to application
startup time and overall system usage on platforms that use jemalloc as
the system allocator (namely FreeBSD).
Submitted by Ian Lepore <ian@FreeBSD.org>.
This ensures POLA on FreeBSD (at least) as free(3) is generally assumed
to not fiddle around with errno.
Signed-off-by: Garrett Cooper <yanegomi@gmail.com>
Modify processing of the lg_chunk option so that it clips an
out-of-range input to the edge of the valid range. This makes it
possible to request the minimum possible chunk size without intimate
knowledge of allocator internals.
Submitted by Ian Lepore (see FreeBSD PR bin/174641).
Fix chunk_recycyle() to unconditionally inform Valgrind that returned
memory is undefined. This fixes Valgrind warnings that would result
from a huge allocation being freed, then recycled for use as an arena
chunk. The arena code would write metadata to the chunk header, and
Valgrind would consider these invalid writes.
Purge unused dirty pages in an order that first performs clean/dirty run
defragmentation, in order to mitigate available run fragmentation.
Remove the limitation that prevented purging unless at least one chunk
worth of dirty pages had accumulated in an arena. This limitation was
intended to avoid excessive purging for small applications, but the
threshold was arbitrary, and the effect of questionable utility.
Relax opt_lg_dirty_mult from 5 to 3. This compensates for increased
likelihood of allocating clean runs, given the same ratio of clean:dirty
runs, and reduces the potential for repeated purging in pathological
large malloc/free loops that push the active:dirty page ratio just over
the purge threshold.
Add the "arenas.extend" mallctl, so that it is possible to create new
arenas that are outside the set that jemalloc automatically multiplexes
threads onto.
Add the ALLOCM_ARENA() flag for {,r,d}allocm(), so that it is possible
to explicitly allocate from a particular arena.
Add the "opt.dss" mallctl, which controls the default precedence of dss
allocation relative to mmap allocation.
Add the "arena.<i>.dss" mallctl, which makes it possible to set the
default dss precedence on a per arena or global basis.
Add the "arena.<i>.purge" mallctl, which obsoletes "arenas.purge".
Add the "stats.arenas.<i>.dss" mallctl.
Mozilla build hides everything by default using visibility pragma and
unhides only explicitly listed headers. But this doesn't work on FreeBSD
because _pthread_mutex_init_calloc_cb is neither documented nor exposed
via any header.
Fix mutex acquisition order inversion for the chunks rtree and the base
mutex. Chunks rtree acquisition was introduced by the previous commit,
so this bug was short-lived.
Add a library constructor for jemalloc that initializes the allocator.
This fixes a race that could occur if threads were created by the main
thread prior to any memory allocation, followed by fork(2), and then
memory allocation in the child process.
Fix the prefork/postfork functions to acquire/release the ctl, prof, and
rtree mutexes. This fixes various fork() child process deadlocks, but
one possible deadlock remains (intentionally) unaddressed: prof
backtracing can acquire runtime library mutexes, so deadlock is still
possible if heap profiling is enabled during fork(). This deadlock is
known to be a real issue in at least the case of libgcc-based
backtracing.
Reported by tfengjun.
mlockall(2) can cause purging via madvise(2) to fail. Fix purging code
to check whether madvise() succeeded, and base zeroed page metadata on
the result.
Reported by Olivier Lecomte.
Refactor code such that arena_mapbits_{large,small}_set() always
preserves the unzeroed flag, and manually manipulate the unzeroed flag
in the one case where it actually gets reset (in arena_chunk_purge()).
This fixes unzeroed preservation bugs in arena_run_split() and
arena_ralloc_large_grow(). These bugs caused large calloc() to return
non-zeroed memory under some circumstances.
Add the --enable-mremap option, and disable the use of mremap(2) by
default, for the same reason that freeing chunks via munmap(2) is
disabled by default on Linux: semi-permanent VM map fragmentation.
Fix chunk_recycle() to correctly compute trailsize and re-insert
trailing chunks. This fixes a major virtual memory leak.
Simplify chunk_record() to avoid dropping/re-acquiring chunks_mtx.
Simplify chunk_alloc_mmap() to no longer attempt map extension. The
extra complexity isn't warranted, because although in the success case
it saves one system call as compared to immediately falling back to
chunk_alloc_mmap_slow(), it also makes the failure case even more
expensive. This simplification removes two bugs:
- For Windows platforms, pages_unmap() wasn't being called for unaligned
mappings prior to falling back to chunk_alloc_mmap_slow(). This
caused permanent virtual memory leaks.
- For non-Windows platforms, alignment greater than chunksize caused
pages_map() to be called with size 0 when attempting map extension.
This always resulted in an mmap() error, and subsequent fallback to
chunk_alloc_mmap_slow().
If an application wants to override je_malloc_message, it is better to define
the symbol locally than to change its value in main(), which might be too late
for various reasons.
Due to je_malloc_message being initialized in util.c, statically linking
jemalloc with an application defining je_malloc_message fails due to
"multiple definition of" the symbol.
Defining it without a value (like je_malloc_conf) makes it more easily
overridable.
Further optimize arena_salloc() to only look at the binind chunk map
bits in the common case.
Add more sanity checks to arena_salloc() that detect chunk map
inconsistencies for large allocations (whether due to allocator bugs or
application bugs).
Embed the bin index for small page runs into the chunk page map, in
order to omit [...] in the following dependent load sequence:
ptr-->mapelm-->[run-->bin-->]bin_info
Move various non-critcal code out of the inlined function chain into
helper functions (tcache_event_hard(), arena_dalloc_small(), and
locking).
Theses newly added macros will be used to implement the equivalent under
MSVC. Also, move the definitions to headers, where they make more sense,
and for some, are even more useful there (e.g. malloc).
Using errno on win32 doesn't quite work, because the value set in a shared
library can't be read from e.g. an executable calling the function setting
errno.
At the same time, since buferror always uses errno/GetLastError, don't pass
it.
MSVC doesn't support C99, and building as C++ to be able to use them is
dangerous, as C++ and C99 are incompatible.
Introduce a VARIABLE_ARRAY macro that either uses VLA when supported,
or alloca() otherwise. Note that using alloca() inside loops doesn't
quite work like VLAs, thus the use of VARIABLE_ARRAY there is discouraged.
It might be worth investigating ways to check whether VARIABLE_ARRAY is
used in such context at runtime in debug builds and bail out if that
happens.