Adds cpp bindings for jemalloc, along with necessary autoconf settings.
This is mostly to add sized deallocation support, which can't be added
from C directly. Sized deallocation is ~10% microbench improvement.
* Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the
easiest way to get c++14 detection.
* Adds various other changes, like CXXFLAGS, to configure.ac.
* Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic
unittest.
* Both new and delete are overridden, to ensure jemalloc is used for
both.
* TODO future enhancement of avoiding extra PLT thunks for new and
delete - sdallocx and malloc are publicly exported jemalloc symbols,
using an alias would link them directly. Unfortunately, was having
trouble getting it to play nice with jemalloc's namespace support.
Testing:
Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0. Only gcc >= 5 has sized
deallocation support, verified that the rest build correctly.
Tested mac osx and Centos.
Tested --with-jemalloc-prefix and --without-export.
This resolves#202.
Fix a fundamental extent_split_wrapper() bug in an error path.
Fix extent_recycle() to deregister unsplittable extents before leaking
them.
Relax xallocx() test assertions so that unsplittable extents don't cause
test failures.
With the removal of subchunk size class infrastructure, there are no
large size classes that are guaranteed to be re-expandable in place
unless munmap() is disabled. Work around these legitimate failures with
rallocx() fallback calls. If there were no test configuration for which
the xallocx() calls succeeded, it would be important to override the
extent hooks for testing purposes, but by default these tests don't use
the rallocx() fallbacks on Linux, so test coverage is still sufficient.
Depending on virtual memory resource limits, it is necessary to attempt
allocating three maximally sized objects to trigger OOM rather than just
two, since the maximum supported size is slightly less than half the
total virtual memory address space.
This fixes a test failure that was introduced by
0c516a00c4 (Make *allocx() size class
overflow behavior defined.).
This resolves#379.
Add (size_t) casts to MALLOCX_ALIGN() macros so that passing the integer
constant 0x80000000 does not cause a compiler warning about invalid
shift amount.
This resolves#354.
Modify xallocx() tests that expect to expand in place to use a separate
arena. This avoids the potential for interposed internal allocations
from e.g. heap profile sampling to disrupt the tests.
This resolves#286.
Zero all trailing bytes of large allocations when
--enable-cache-oblivious configure option is enabled. This regression
was introduced by 8a03cf039c (Implement
cache index randomization for large allocations.).
Zero trailing bytes of huge allocations when resizing from/to a size
class that is not a multiple of the chunk size.
When junk filling is enabled, shrinking an allocation fills the bytes
that were previously allocated but now aren't. Purging the chunk before
doing that is just a waste of time.
This resolves#260.
Cascade from decommit to purge when purging unused dirty pages, so that
it is possible to decommit cleaned memory rather than just purging. For
non-Windows debug builds, decommit runs rather than purging them, since
this causes access of deallocated runs to segfault.
This resolves#251.
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls. The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.
Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained. This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.
Fix chunk purging. Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.
This resolves#176 and #201.
Fix huge_ralloc_no_move() to succeed if an allocation request results in
the same usable size as the existing allocation, even if the request
size is smaller than the usable size. This bug did not cause
correctness issues, but it could cause unnecessary moves during
reallocation.
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.
Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.
This resolves#232.
Add mallctls:
- arenas.lg_dirty_mult is initialized via opt.lg_dirty_mult, and can be
modified to change the initial lg_dirty_mult setting for newly created
arenas.
- arena.<i>.lg_dirty_mult controls an individual arena's dirty page
purging threshold, and synchronously triggers any purging that may be
necessary to maintain the constraint.
- arena.<i>.chunk.purge allows the per arena dirty page purging function
to be replaced.
This resolves#93.
This regression was introduced by
88fef7ceda (Refactor huge_*() calls into
arena internals.), and went undetected because of the --enable-debug
regression.
This adds support for expanding huge allocations in-place by requesting
memory at a specific address from the chunk allocator.
It's currently only implemented for the chunk recycling path, although
in theory it could also be done by optimistically allocating new chunks.
On Linux, it could attempt an in-place mremap. However, that won't work
in practice since the heap is grown downwards and memory is not unmapped
(in a normal build, at least).
Repeated vector reallocation micro-benchmark:
#include <string.h>
#include <stdlib.h>
int main(void) {
for (size_t i = 0; i < 100; i++) {
void *ptr = NULL;
size_t old_size = 0;
for (size_t size = 4; size < (1 << 30); size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
memset(ptr + old_size, 0xff, size - old_size);
old_size = size;
}
free(ptr);
}
}
The glibc allocator fails to do any in-place reallocations on this
benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it
elides the cost of copies via mremap, which is currently not something
that jemalloc can use.
With this improvement, jemalloc still fails to do any in-place huge
reallocations for the first outer loop, but then succeeds 100% of the
time for the remaining 99 iterations. The time spent doing allocations
and copies drops down to under 5%, with nearly all of it spent doing
purging + faulting (when huge pages are disabled) and the array memset.
An improved mremap API (MREMAP_RETAIN - #138) would be far more general
but this is a portable optimization and would still be useful on Linux
for xallocx.
Numbers with transparent huge pages enabled:
glibc (copies elided via MREMAP_MAYMOVE): 8.471s
jemalloc: 17.816s
jemalloc + no-op madvise: 13.236s
jemalloc + this commit: 6.787s
jemalloc + this commit + no-op madvise: 6.144s
Numbers with transparent huge pages disabled:
glibc (copies elided via MREMAP_MAYMOVE): 15.403s
jemalloc: 39.456s
jemalloc + no-op madvise: 12.768s
jemalloc + this commit: 15.534s
jemalloc + this commit + no-op madvise: 6.354s
Closes#137
This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller. It avoids some extra reads in the
thread cache fast path. In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.
An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.
The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%. It may have a different impact on a
real workload.
Closes#28
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation). This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.
Remove the top level huge stats, and replace them with per arena huge
stats.
Normalize function names and types to *dalloc* (some were *dealloc*).
Remove the --enable-mremap option. As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance. The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.