The raw clock variant is slow (even relative to plain CLOCK_MONOTONIC),
whereas the coarse clock variant is faster than CLOCK_MONOTONIC, but
still has resolution (~1ms) that is adequate for our purposes.
This resolves#479.
glibc defines its malloc implementation with several weak and strong
symbols:
strong_alias (__libc_calloc, __calloc) weak_alias (__libc_calloc, calloc)
strong_alias (__libc_free, __cfree) weak_alias (__libc_free, cfree)
strong_alias (__libc_free, __free) strong_alias (__libc_free, free)
strong_alias (__libc_malloc, __malloc) strong_alias (__libc_malloc, malloc)
The issue is not with the weak symbols, but that other parts of glibc
depend on __libc_malloc explicitly. Defining them in terms of jemalloc
API's allows the linker to drop glibc's malloc.o completely from the link,
and static linking no longer results in symbol collisions.
Another wrinkle: jemalloc during initialization calls sysconf to
get the number of CPU's. GLIBC allocates for the first time before
setting up isspace (and other related) tables, which are used by
sysconf. Instead, use the pthread API to get the number of
CPUs with GLIBC, which seems to work.
This resolves#442.
Rather than protecting dss operations with a mutex, use atomic
operations. This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
chunk_in_dss() and the newly added chunk_dss_mergeable(), which can be
called multiple times during chunk deallocations.
This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.
This resolves#425.
Add spin_t and spin_{init,adaptive}(), which provide a simple
abstraction for adaptive spinning.
Adaptively spin during busy waits in bootstrapping and rtree node
initialization.
Simplify decay-based purging attempts to only be triggered when the
epoch is advanced, rather than every time purgeable memory increases.
In a correctly functioning system (not previously the case; see below),
this only causes a behavior difference if during subsequent purge
attempts the least recently used (LRU) purgeable memory extent is
initially too large to be purged, but that memory is reused between
attempts and one or more of the next LRU purgeable memory extents are
small enough to be purged. In practice this is an arbitrary behavior
change that is within the set of acceptable behaviors.
As for the purging fix, assure that arena->decay.ndirty is recorded
*after* the epoch advance and associated purging occurs. Prior to this
fix, it was possible for purging during epoch advance to cause a
substantially underrepresentative (arena->ndirty - arena->decay.ndirty),
i.e. the number of dirty pages attributed to the current epoch was too
low, and a series of unintended purges could result. This fix is also
relevant in the context of the simplification described above, but the
bug's impact would be limited to over-purging at epoch advances.
Instead, move the epoch backward in time. Additionally, add
nstime_monotonic() and use it in debug builds to assert that time only
goes backward if nstime_update() is using a non-monotonic time source.
Add missing #include <time.h>. The critical time facilities appear to
have been transitively included via unistd.h and sys/time.h, but in
principle this omission was capable of having caused
clock_gettime(CLOCK_MONOTONIC, ...) to have been overlooked in favor of
gettimeofday(), which in turn could cause spurious non-monotonic time
updates.
Refactor nstime_get() out of nstime_update() and add configure tests for
all variants.
Add CLOCK_MONOTONIC_RAW support (Linux-specific) and
mach_absolute_time() support (OS X-specific).
Do not fall back to clock_gettime(CLOCK_REALTIME, ...). This was a
fragile Linux-specific workaround, which we're unlikely to use at all
now that clock_gettime(CLOCK_MONOTONIC_RAW, ...) is supported, and if we
have no choice besides non-monotonic clocks, gettimeofday() is only
incrementally worse.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements. This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size. Note that pszind_t and
szind_t are not interchangeable.
GCC 4.9.3 cross-compiled for sparc64 defines __sparc_v9__, not
__sparc64__ nor __sparcv9. This prevents LG_QUANTUM from being defined
properly. Adding this new value to the check solves the issue.
Some bug (either in the red-black tree code, or in the pgi compiler) seems to
cause red-black trees to become unbalanced. This issue seems to go away if we
don't use compact red-black trees. Since red-black trees don't seem to be used
much anymore, I opted for what seems to be an easy fix here instead of digging
in and trying to find the root cause of the bug.
Some context in case it's helpful:
I experienced a ton of segfaults while using pgi as Chapel's target compiler
with jemalloc 4.0.4. The little bit of debugging I did pointed me somewhere
deep in red-black tree manipulation, but I didn't get a chance to investigate
further. It looks like 4.2.0 replaced most uses of red-black trees with
pairing-heaps, which seems to avoid whatever bug I was hitting.
However, `make check_unit` was still failing on the rb test, so I figured the
core issue was just being masked. Here's the `make check_unit` failure:
```sh
=== test/unit/rb ===
test_rb_empty: pass
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
node_remove:test/unit/rb.c:190: Failed assertion: (imbalances) == (0) --> 2 != 0: Tree is unbalanced
<jemalloc>: test/unit/rb.c:43: Failed assertion: "pathp[-1].cmp < 0"
test/test.sh: line 22: 12926 Aborted
Test harness error
```
While starting to debug I saw the RB_COMPACT option and decided to check if
turning that off resolved the bug. It seems to have fixed it (`make check_unit`
passes and the segfaults under Chapel are gone) so it seems like on okay
work-around. I'd imagine this has performance implications for red-black trees
under pgi, but if they're not going to be used much anymore it's probably not a
big deal.
Add a configure check for __builtin_unreachable instead of basing its
availability on the __GNUC__ version. On OS X using gcc (a real gcc, not the
bundled version that's just a gcc front-end) leads to a linker assertion:
https://github.com/jemalloc/jemalloc/issues/266
It turns out that this is caused by a gcc bug resulting from the use of
__builtin_unreachable():
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438
To work around this bug, check that __builtin_unreachable() actually works at
configure time, and if it doesn't use abort() instead. The check is based on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438#c21.
With this `make check` passes with a homebrew installed gcc-5 and gcc-6.
In the case where prof_alloc_prep() is called with an over-estimate of
allocation size, and sampling doesn't end up being triggered, the tctx
must be discarded.
Revert 245ae6036c (Support --with-lg-page
values larger than actual page size.), because it could cause VM map
fragmentation if the kernel grows mmap()ed memory downward.
This resolves#391.
Short-circuit commonly called witness functions so that they only
execute in debug builds, and remove equivalent guards from mutex
functions. This avoids pointless code execution in
witness_assert_lockless(), which is typically called twice per
allocation/deallocation function invocation.
Inline commonly called witness functions so that optimized builds can
completely remove calls as dead code.
b2c0d6322d (Add witness, a simple online
locking validator.) caused a broad propagation of tsd throughout the
internal API, but tsd_fetch() was designed to fail prior to tsd
bootstrapping. Fix this by splitting tsd_t into non-nullable tsd_t and
nullable tsdn_t, and modifying all internal APIs that do not critically
rely on tsd to take nullable pointers. Furthermore, add the
tsd_booted_get() function so that tsdn_fetch() can probe whether tsd
bootstrapping is complete and return NULL if not. All dangerous
conversions of nullable pointers are tsdn_tsd() calls that assert-fail
on invalid conversion.
This is a broader application of optimizations to malloc() and free() in
f4a0f32d34 (Fast-path improvement:
reduce # of branches and unnecessary operations.).
This resolves#321.
If the OS overcommits:
- Commit all mappings in pages_map() regardless of whether the caller
requested committed memory.
- Linux-specific: Specify MAP_NORESERVE to avoid
unfortunate interactions with heuristic overcommit mode during
fork(2).
This resolves#193.
Split arena_choose() into arena_[i]choose() and use arena_ichoose() for
arena lookup during internal allocation. This fixes huge_palloc() so
that it always succeeds during extent node allocation.
This regression was introduced by
66cd953514 (Do not allocate metadata via
non-auto arenas, nor tcaches.).
Change test-related mangling to simplify symbol filtering.
The following commands can be used to detect missing/obsolete symbol
mangling, with the caveat that the full set of symbols is based on the
union of symbols generated by all configurations, some of which are
platform-specific:
./autogen.sh --enable-debug --enable-prof --enable-lazy-lock
make all tests
nm -a lib/libjemalloc.a src/*.jet.o \
|grep " [TDBCR] " \
|awk '{print $3}' \
|sed -e 's/^\(je_\|jet_\(n_\)\?\)\([a-zA-Z0-9_]*\)/\3/g' \
|LC_COLLATE=C sort -u \
|grep -v \
-e '^\(malloc\|calloc\|posix_memalign\|aligned_alloc\|realloc\|free\)$' \
-e '^\(m\|r\|x\|s\|d\|sd\|n\)allocx$' \
-e '^mallctl\(\|nametomib\|bymib\)$' \
-e '^malloc_\(stats_print\|usable_size\|message\)$' \
-e '^\(memalign\|valloc\)$' \
-e '^__\(malloc\|memalign\|realloc\|free\)_hook$' \
-e '^pthread_create$' \
> /tmp/private_symbols.txt
Fix a compilation error that occurs if Valgrind is not enabled. This
regression was caused by b2c0d6322d (Add
witness, a simple online locking validator.).
During over-allocation in preparation for creating aligned mappings,
allocate one more page than necessary if PAGE is the actual page size,
so that trimming still succeeds even if the system returns a mapping
that has less than PAGE alignment. This allows compiling with e.g. 64
KiB "pages" on systems that actually use 4 KiB pages.
Note that for e.g. --with-lg-page=21, it is also necessary to increase
the chunk size (e.g. --with-malloc-conf=lg_chunk:22) so that there are
at least two "pages" per chunk. In practice this isn't a particularly
compelling configuration because so much (unusable) virtual memory is
dedicated to chunk headers.
Refactor ph to support configurable comparison functions. Use a cpp
macro code generation form equivalent to the rb macros so that pairing
heaps can be used for both run heaps and chunk heaps.
Remove per node parent pointers, and instead use leftmost siblings' prev
pointers to track parents.
Fix multi-pass sibling merging to iterate over intermediate results
using a FIFO, rather than a LIFO. Use this fixed sibling merging
implementation for both merge phases of the auxiliary twopass algorithm
(first merging the aux list, then replacing the root with its merged
children). This fixes both degenerate merge behavior and the potential
for deep recursion.
This regression was introduced by
6bafa6678f (Pairing heap).
This resolves#371.
Fix bitmap_sfu() to shift by LG_BITMAP_GROUP_NBITS rather than
hard-coded 6 when using linear (non-USE_TREE) bitmap search. In
practice this affects only 64-bit systems for which sizeof(long) is not
8 (i.e. Windows), since USE_TREE is defined for 32-bit systems.
This regression was caused by b8823ab026
(Use linear scan for small bitmaps).
This resolves#368.
Move chunk_dalloc_arena()'s implementation into chunk_dalloc_wrapper(),
so that if the dalloc hook fails, proper decommit/purge/retain cascading
occurs. This fixes three potential chunk leaks on OOM paths, one during
dss-based chunk allocation, one during chunk header commit (currently
relevant only on Windows), and one during rtree write (e.g. if rtree
node allocation fails).
Merge chunk_purge_arena() into chunk_purge_default() (refactor, no
change to functionality).
The arenas_extend() function was renamed to arenas_init() in commit
8bb3198f72, but its function declaration
was not removed from jemalloc_internal.h.in.