In the process, we can do some strength reduction, changing the fetch-adds and
fetch-subs to be simple loads followed by stores, since the modifications all
occur while holding the mutex.
The C11 atomics backport removed this #define, which degraded atomic 64-bit
reads to require a lock even on platforms that support them. This commit fixes
that.
This fixes tcache_flush for manual tcaches, which wasn't able to find
the correct arena it associated with. Also changed the decay test to
cover this case (by using manually created arenas).
This simplifies what would be pairing heap operations to the equivalent
of LIFO queue operations. This is a complementary optimization in the
context of delayed coalescing for cached extents.
These functions select the easiest-to-remove element in the heap, which
is either the most recently inserted aux list element or the root. If
no calls are made to first() or remove_first(), the behavior (and time
complexity) is the same as for a LIFO queue.
Rather than purging uncoalesced extents, perform just enough incremental
coalescing to purge only fully coalesced extents. In the absence of
cached extent reuse, the immediate versus delayed incremental purging
algorithms result in the same purge order.
This resolves#655.
Fix the test_decay_ticker test to carefully control slab
creation/destruction such that the decay backlog reliably reaches zero.
Use an isolated arena so that no extraneous allocation can confuse the
situation. Speed up time during the latter part of the test so that the
entire decay time can expire in a reasonable amount of wall time.
In the C11 atomics backport, we couldn't use not_reached() in
atomic_enum_to_builtin (in atomic_gcc_atomic.h), since atomic.h was hermetic and
assert.h wasn't; there was a dependency issue. assert.h is hermetic now, so we
can include it.
This is the first header refactoring diff, #533. It splits the assert and util
components into separate, hermetic, header files. In the process, it splits out
two of the large sub-components of util (the stdio.h replacement, and bit
manipulation routines) into their own components (malloc_io.h and bit_util.h).
This is mostly to break up cyclic dependencies, but it also breaks off a good
chunk of the catch-all-ness of util, which is nice.
Convert the nrequests field to be partially derived, and the curlextents
to be fully derived, in order to reduce the number of stats updates
needed during common operations.
This change affects ndalloc stats during arena reset, because it is no
longer possible to cancel out ndalloc effects (curlextents would become
negative).
This introduces a backport of C11 atomics. It has four implementations; ranked
in order of preference, they are:
- GCC/Clang __atomic builtins
- GCC/Clang __sync builtins
- MSVC _Interlocked builtins
- C11 atomics, from <stdatomic.h>
The primary advantages are:
- Close adherence to the standard API gives us a defined memory model.
- Type safety: atomic objects are now separate types from non-atomic ones, so
that it's impossible to mix up atomic and non-atomic updates (which is
undefined behavior that compilers are starting to take advantage of).
- Efficiency: we can specify ordering for operations, avoiding fences and
atomic operations on strongly ordered architectures (example:
`atomic_write_u32(ptr, val);` involves a CAS loop, whereas
`atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store.
This diff leaves in the current atomics API (implementing them in terms of the
backport). This lets us transition uses over piecemeal.
Testing:
This is by nature hard to test. I've manually tested the first three options on
Linux on gcc by futzing with the #defines manually, on freebsd with gcc and
clang, on MSVC, and on OS X with clang. All of these were x86 machines though,
and we don't have any test infrastructure set up for non-x86 platforms.
In the long term, we'll transition to C99-style inline semantics. In the
short-term, this will allow both styles to coexist without breaking one another.
This avoids signed/unsigned comparison warnings when specifying integer
constants as inputs.
Clean up whitespace and add clarifying parentheses for
CONF_HANDLE_SIZE_T(opt_lg_chunk, ...).
Detect whether chunks start off as THP-capable by default (according to
the state of /sys/kernel/mm/transparent_hugepage/enabled), and use this
as the basis for whether to call pages_nohuge() once per chunk during
first purge of any of the chunk's page runs.
Add the --disable-thp configure option, as well as the the opt.thp
mallctl.
This resolves#541.
Convert CFLAGS to be a concatenation:
CFLAGS := CONFIGURE_CFLAGS SPECIFIED_CFLAGS EXTRA_CFLAGS
This ordering makes it possible to override the flags set by the
configure script both during and after configuration, with CFLAGS and
EXTRA_CFLAGS, respectively.
This resolves#619.
When multiple threads calling stats_print, race could happen as we read the
counters in separate mallctl calls; and the removed assertion could fail when
other operations happened in between the mallctl calls. For simplicity, output
"race" in the utilization field in this case.
This resolves#616.
Fix lg_chunk clamping to take into account cache-oblivious large
allocation. This regression only resulted in incorrect behavior if
!config_fill (false unless --disable-fill specified) and
config_cache_oblivious (true unless --disable-cache-oblivious
specified).
This regression was introduced by
8a03cf039c (Implement cache index
randomization for large allocations.), which was first released in
4.0.0.
This resolves#555.
Remove obsolete unit test scaffolding for extent quantization. Remove
redundant assertions. Add an assertion to
extents_first_best_fit_locked() that should help prevent aligned
allocation regressions.
Implement and test a JSON validation parser. Use the parser to validate
JSON output from malloc_stats_print(), with a significant subset of
supported output options.
This resolves#583.
Fix chunk_alloc_dss() to account for bytes that are not a multiple of
the chunk size. This regression was introduced by
e2bcf037d4 (Make dss operations
lockless.), which was first released in 4.3.0.
We don't touch witness at all when config_debug == false. Let's only pay the
memory cost in malloc_mutex_s when needed. Note that when !config_debug, we keep
the field in a union so that we don't have to do #ifdefs in multiple places.
In some cases the prof machinery allocates (in order to modify the
bt2gctx hash table), and such operations are synchronized via
bt2gctx_mtx. Rather than asserting that no locks are held on entry
into functions that may call prof_gdump(), make the weaker assertion
that no "core" locks are held. The prof machinery enqueues dumps
triggered by prof_gdump() calls when bt2gctx_mtx is held, so this
weakened assertion avoids false failures in such cases.
This fixes interactions with witness_assert_depth[_to_rank](), which was
added in dad74bd3c8 (Convert
witness_assert_lockless() to witness_assert_lock_depth().).
malloc_conf does not reliably work with MSVC, which complains of
"inconsistent dll linkage", i.e. its inability to support the
application overriding malloc_conf when dynamically linking/loading.
Work around this limitation by adding test harness support for per test
shell script sourcing, and converting all tests to use MALLOC_CONF
instead of malloc_conf.
Synchronize tcaches with tcaches_mtx rather than ctl_mtx. Add missing
synchronization for tcache flushing. This bug was introduced by
1cb181ed63 (Implement explicit tcache
support.), which was first released in 4.0.0.
malloc_conf does not reliably work with MSVC, which complains of
"inconsistent dll linkage", i.e. its inability to support the
application overriding malloc_conf when dynamically linking/loading.
Work around this limitation by adding test harness support for per test
shell script sourcing, and converting all tests to use MALLOC_CONF
instead of malloc_conf.