Added opt.background_thread to enable background threads, which handles purging
currently. When enabled, decay ticks will not trigger purging (which will be
left to the background threads). We limit the max number of threads to NCPUs.
When percpu arena is enabled, set CPU affinity for the background threads as
well.
The sleep interval of background threads is dynamic and determined by computing
number of pages to purge in the future (based on backlog).
Rather than using a manually maintained list of internal symbols to
drive name mangling, add a compilation phase to automatically extract
the list of internal symbols.
This resolves#677.
Add the extent_destroy_t extent destruction hook to extent_hooks_t, and
use it during arena destruction. This hook explicitly communicates to
the callee that the extent must be destroyed or tracked for later reuse,
lest it be permanently leaked. Prior to this change, retained extents
could unintentionally be leaked if extent retention was enabled.
This resolves#560.
Control use of munmap(2) via a run-time option rather than a
compile-time option (with the same per platform default). The old
behavior of --disable-munmap can be achieved with
--with-malloc-conf=munmap:false.
This partially resolves#580.
The explicit compiler warning suppression controlled by this option is
universally desirable, so remove the ability to disable suppression.
This partially resolves#580.
Four size classes per size doubling has proven to be a universally good
choice for the entire 4.x release series, so there's little point to
preserving this configurability.
This partially resolves#580.
This can catch bugs in which one header defines a numeric constant, and another
uses it without including the defining header. Undefined preprocessor symbols
expand to '0', so that this will compile fine, silently doing the math wrong.
Continue to use ivsalloc() when --enable-debug is specified (and add
assertions to guard against 0 size), but stop providing a documented
explicit semantics-changing band-aid to dodge undefined behavior in
sallocx() and malloc_usable_size(). ivsalloc() remains compiled in,
unlike when #211 restored --enable-ivsalloc, and if
JEMALLOC_FORCE_IVSALLOC is defined during compilation, sallocx() and
malloc_usable_size() will still use ivsalloc().
This partially resolves#580.
Simplify configuration by removing the --disable-tcache option, but
replace the testing for that configuration with
--with-malloc-conf=tcache:false.
Fix the thread.arena and thread.tcache.flush mallctls to work correctly
if tcache is disabled.
This partially resolves#580.
This is a biggy. jemalloc_internal.h has been doing multiple jobs for a while
now:
- The source of system-wide definitions.
- The catch-all include file.
- The module header file for jemalloc.c
This commit splits up this functionality. The system-wide definitions
responsibility has moved to jemalloc_preamble.h. The catch-all include file is
now jemalloc_internal_includes.h. The module headers for jemalloc.c are now in
jemalloc_internal_[externs|inlines|types].h, just as they are for the other
modules.
Hyper-threaded CPUs may need a special instruction inside spin loops in
order to yield to another virtual CPU. The 'pause' instruction that is
available for x86 is not supported on Power.
Apparently the extended mnemonics like yield, mdoio, and mdoom are not
actually implemented on POWER8, although mentioned in the ISA 2.07
document. The recommended magic bits are an 'or 31,31,31'.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
This introduces a backport of C11 atomics. It has four implementations; ranked
in order of preference, they are:
- GCC/Clang __atomic builtins
- GCC/Clang __sync builtins
- MSVC _Interlocked builtins
- C11 atomics, from <stdatomic.h>
The primary advantages are:
- Close adherence to the standard API gives us a defined memory model.
- Type safety: atomic objects are now separate types from non-atomic ones, so
that it's impossible to mix up atomic and non-atomic updates (which is
undefined behavior that compilers are starting to take advantage of).
- Efficiency: we can specify ordering for operations, avoiding fences and
atomic operations on strongly ordered architectures (example:
`atomic_write_u32(ptr, val);` involves a CAS loop, whereas
`atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store.
This diff leaves in the current atomics API (implementing them in terms of the
backport). This lets us transition uses over piecemeal.
Testing:
This is by nature hard to test. I've manually tested the first three options on
Linux on gcc by futzing with the #defines manually, on freebsd with gcc and
clang, on MSVC, and on OS X with clang. All of these were x86 machines though,
and we don't have any test infrastructure set up for non-x86 platforms.
Rather than dynamically building a table to aid per level computations,
define a constant table at compile time. Omit both high and low
insignificant bits. Use one to three tree levels, depending on the
number of significant bits.
The SDK jemalloc is built against might be not be the latest for various
reasons, but the resulting binary ought to work on newer versions of
OSX.
In order to ensure this, we need the fullest definitions possible, so
copy what we need from the latest version of malloc/malloc.h available
on opensource.apple.com.
Add the --with-lg-hugepage configure option, but automatically configure
LG_HUGEPAGE even if it isn't specified.
Add the pages_[no]huge() functions, which toggle huge page state via
madvise(..., MADV_[NO]HUGEPAGE) calls.
Convert CFLAGS/CXXFLAGS to be concatenations:
CFLAGS := CONFIGURE_CFLAGS SPECIFIED_CFLAGS EXTRA_CFLAGS
CXXFLAGS := CONFIGURE_CXXFLAGS SPECIFIED_CXXFLAGS EXTRA_CXXFLAGS
This ordering makes it possible to override the flags set by the
configure script both during and after configuration, with
CFLAGS/CXXFLAGS and EXTRA_CFLAGS/EXTRA_CXXFLAGS, respectively.
This resolves#504.
Adds cpp bindings for jemalloc, along with necessary autoconf settings.
This is mostly to add sized deallocation support, which can't be added
from C directly. Sized deallocation is ~10% microbench improvement.
* Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the
easiest way to get c++14 detection.
* Adds various other changes, like CXXFLAGS, to configure.ac.
* Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic
unittest.
* Both new and delete are overridden, to ensure jemalloc is used for
both.
* TODO future enhancement of avoiding extra PLT thunks for new and
delete - sdallocx and malloc are publicly exported jemalloc symbols,
using an alias would link them directly. Unfortunately, was having
trouble getting it to play nice with jemalloc's namespace support.
Testing:
Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0. Only gcc >= 5 has sized
deallocation support, verified that the rest build correctly.
Tested mac osx and Centos.
Tested --with-jemalloc-prefix and --without-export.
This resolves#202.
The core issue here is the weak linking of the symbol, and in certain
environments--for instance, using the latest Xcode (8.1) with the latest
SDK (10.12)--os_unfair_lock may resolve even though you're compiling on
a host that doesn't support it (10.11).
We can use the availability macros to circumvent this problem, and
detect that we're not compiling for a target that is going to support
them and error out at compile time. The other alternative is to do a
runtime check, but that presents issues for cross-compiling.
Some versions of Android provide a pthreads library without providing
pthread_atfork(), so in practice a separate feature test is necessary
for the latter.
Add feature tests for the MADV_FREE and MADV_DONTNEED flags to
madvise(2), so that MADV_FREE is detected and used for Linux kernel
versions 4.5 and newer. Refactor pages_purge() so that on systems which
support both flags, MADV_FREE is preferred over MADV_DONTNEED.
This resolves#387.
The raw clock variant is slow (even relative to plain CLOCK_MONOTONIC),
whereas the coarse clock variant is faster than CLOCK_MONOTONIC, but
still has resolution (~1ms) that is adequate for our purposes.
This resolves#479.
Conditionalize use of --whole-archive on the platform plus compiler,
rather than on the ABI. This fixes a regression caused by
7b24c6e557 (Use --whole-archive when
linking integration tests on MinGW.).
This reverts 13473c7c66, which was
intended to work around bootstrapping issues when linking statically.
However, this actually causes problems in various other configurations,
so this reversion may force a future fix for the underlying problem, if
it still exists.
Add missing #include <time.h>. The critical time facilities appear to
have been transitively included via unistd.h and sys/time.h, but in
principle this omission was capable of having caused
clock_gettime(CLOCK_MONOTONIC, ...) to have been overlooked in favor of
gettimeofday(), which in turn could cause spurious non-monotonic time
updates.
Refactor nstime_get() out of nstime_update() and add configure tests for
all variants.
Add CLOCK_MONOTONIC_RAW support (Linux-specific) and
mach_absolute_time() support (OS X-specific).
Do not fall back to clock_gettime(CLOCK_REALTIME, ...). This was a
fragile Linux-specific workaround, which we're unlikely to use at all
now that clock_gettime(CLOCK_MONOTONIC_RAW, ...) is supported, and if we
have no choice besides non-monotonic clocks, gettimeofday() is only
incrementally worse.
In 1167e9e, I accidentally tested je_cv_gcc_builtin_ffsl instead of
je_cv_gcc_builtin_unreachable (copy-paste error), which meant that
JEMALLOC_INTERNAL_UNREACHABLE was always getting defined as abort even if
__builtin_unreachable support was detected.
Cray is pretty warning-happy, so disable ones that aren't helpful. Each warning
has a numeric value instead of having named flags to disable specific warnings.
Disable warnings 128 and 1357.
128: Ignore unreachable code warning. Cray warns about `not_reached()` not
being reachable in a couple of places because it detects that some loops
will never terminate.
1357: Ignore warning about redefinition of malloc and friends
With this patch, Cray 8.4.0 and 8.5.1 build cleanly and pass `make check`
Cray uses -herror_on_warning instead of -Werror. Use it everywhere -Werror is
currently used for __attribute__ checks so configure actually detects they're
not supported.
Cray only supports `-M` for generating dependency files. It does not support
`-MM` or `-MT`, so don't try to use them. I just reused the existing mechanism
for turning auto-dependency generation off (`CC_MM=`), but it might be more
principled to add a configure test to check if the compiler supports `-MM` and
`-MT`, instead of manually tracking which compilers don't support those flags.
Get jemalloc building and passing `make check_unit` with cray 8.4. An inlining
bug in 8.4 results in internal errors while trying to build jemalloc. This has
already been reported and fixed for the 8.5 release.
In order to work around the inlining bug, disable gnu compatibility and limit
ipa optimizations.
I copied the msvc compiler check for cray, but note that we perform the test
even if we think we're using gcc because cray pretends to be gcc if `-hgnu`
(which is enabled by default) is used. I couldn't come up with a principled way
to check for the inlining bug, so instead I just checked compiler versions.
The build had lots of warnings I need to address and cray doesn't support -MM
or -MT for dependency tracking, so I had to do `make CC_MM=`.
Add a configure check for __builtin_unreachable instead of basing its
availability on the __GNUC__ version. On OS X using gcc (a real gcc, not the
bundled version that's just a gcc front-end) leads to a linker assertion:
https://github.com/jemalloc/jemalloc/issues/266
It turns out that this is caused by a gcc bug resulting from the use of
__builtin_unreachable():
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438
To work around this bug, check that __builtin_unreachable() actually works at
configure time, and if it doesn't use abort() instead. The check is based on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438#c21.
With this `make check` passes with a homebrew installed gcc-5 and gcc-6.
The Cray compiler wrappers will often add `-lrt` to the base compiler with
`-static` linking (the default at most sites.) However, `-lrt` isn't
automatically added with `-dynamic`. This means that if jemalloc was built with
`-static`, but then used in a program with `-dynamic` jemalloc won't have
detected that librt is a dependency.
The integration and stress tests use -dynamic, which is causing undefined
references to clock_gettime().
This just adds an extra check for librt (ignoring the autoconf cache) with
`-dynamic` thrown. It also stops filtering librt from the integration tests.
With this `make check` passes for:
- PrgEnv-gnu
- PrgEnv-intel
- PrgEnv-pgi
PrgEnv-cray still needs more work (will be in a separate patch.)
Cray systems come with compiler wrappers to simplify building parallel
applications. CC is the C++ wrapper, and cc is the C wrapper.
The wrappers call the base {Cray, Intel, PGI, or GNU} compiler with vendor
specific flags. The "Programming Environment" (prgenv) that's currently loaded
determines the base compiler. e.g. compiling with gnu looks something like:
module load PrgEnv-gnu
cc hello.c
On most systems the wrappers defaults to `-static` mode, which causes them to
only look for static libraries, and not for any dynamic ones (even if the
dynamic version was explicitly listed.)
The integration and stress tests expect to be using the .so, so we have to run
the with -dynamic so that wrapper will find/use the .so.
If the OS overcommits:
- Commit all mappings in pages_map() regardless of whether the caller
requested committed memory.
- Linux-specific: Specify MAP_NORESERVE to avoid
unfortunate interactions with heuristic overcommit mode during
fork(2).
This resolves#193.
Don't assume Bourne shell is in /bin/sh when running size_classes.sh .
Consider __sparcv9 a synonym for __sparc64__ when defining LG_QUANTUM.
This resolves#275.
Fix TLS configuration such that it is enabled by default for platforms
on which it works correctly. This regression was introduced by
ac5db02034 (Make --enable-tls and
--enable-lazy-lock take precedence over configure.ac-hardcoded
defaults).
Replace JEMALLOC_ATTR(format(printf, ...). with
JEMALLOC_FORMAT_PRINTF(), so that configuration feature tests can
omit the attribute if it would cause extraneous compilation warnings.
Add various function attributes to the exported functions to give the
compiler more information to work with during optimization, and also
specify throw() when compiling with C++ on Linux, in order to adequately
match what __THROW does in glibc.
This resolves#237.
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.
Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.
This resolves#232.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.
Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.
For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size. Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.
Add the --disable-cache-oblivious option, primarily intended for
performance testing.
This resolves#13.
This rename avoids installation collisions with the upstream gperftools.
Additionally, jemalloc's per thread heap profile functionality
introduced an incompatible file format, so it's now worthwhile to
clearly distinguish jemalloc's version of this script from the upstream
version.
This resolves#229.
under some compiler (gcc 4.8.4 in particular), the auto-detection of TLS
don't work properly.
force tls to be disabled.
the testsuite pass under gcc (4.8.4) and gcc (4.2.1)
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.
This resolves#211.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.
Add chunk node caching to arenas, in order to avoid contention on the
base allocator.
Use chunks_rtree to look up huge allocations rather than a red-black
tree. Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).
Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled. The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.
Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering. These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.
Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
It often happens that code changes introduce mixed declarations, that then
break building with Visual Studio. Since the code style is to not use
mixed declarations anyways, we might as well enforce it with -Werror.
Add:
--with-lg-page
--with-lg-page-sizes
--with-lg-size-class-group
--with-lg-quantum
Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.
Fix various edge conditions exposed by the configure options.
Revert 6716aa8352 (Force use of TLS if
heap profiling is enabled.). No existing tests indicate that this is
necessary, nor does code inspection uncover any potential issues. Most
likely the original commit covered up a bug related to tsd-internal
allocation that has since been fixed.
It has an unused variable, so it was always failing (at least with gcc
4.9.1). Alternatively, the `-Werror` flag could be removed if it isn't
strictly necessary.
This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller. It avoids some extra reads in the
thread cache fast path. In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.
An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.
The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%. It may have a different impact on a
real workload.
Closes#28
Relax the "are we in a git repo?" check to succeed even if the top level
jemalloc directory is not at the top level of the git repo.
Add git tag filtering so that only version triplets match when
generating VERSION.
Add fallback bogus VERSION creation, so that in the worst case, rather
than generating empty values for e.g. JEMALLOC_VERSION_MAJOR,
configuration ends up generating useless constants.
__*_hook() is glibc, but on at least one glibc platform (homebrew),
the __GLIBC__ define isn't set correctly and we miss being able to
use these hooks.
Do a feature test for it during configuration so that we enable it
anywhere the hooks are actually available.
On Windows, srcroot may start with "drive:", which confuses autoconf's
AC_CONFIG_* macros. The macros works equally well without ${srcroot},
provided some adjustment to Makefile.in.
Now this code is more portable and now people can use faster shells than
Bash such as Dash.
To use a faster shell with autoconf set the CONFIG_SHELL environment
variable to the shell and run the configure script with the shell.
When building with -O0, GCC doesn't use builtins for ffs and ffsl calls,
and uses library function calls instead. But the Android NDK doesn't have
those functions exported from any library, leading to build failure.
However, using __builtin_ffs* uses the builtin inlines.
Some platforms, such as Google's Portable Native Client, use Newlib and
thus lack access to madvise(2). In those instances, pages_purge() is
transformed into a no-op.
Some platforms (like those using Newlib) don't have ffs/ffsl. This
commit adds a check to configure.ac for __builtin_ffsl if ffsl isn't
found. __builtin_ffsl performs the same function as ffsl, and has the
added benefit of being available on any platform utilizing
Gcc-compatible compiler.
This change does not address the used of ffs in the MALLOCX_ARENA()
macro.
Add size class computation capability, currently used only as validation
of the size class lookup tables. Generalize the size class spacing used
for bins, for eventual use throughout the full range of allocation
sizes.
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation). This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.
Remove the top level huge stats, and replace them with per arena huge
stats.
Normalize function names and types to *dalloc* (some were *dealloc*).
Remove the --enable-mremap option. As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance. The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.
Make dss non-optional on all platforms which support sbrk(2).
Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
"secondary" precedence is specified, but sbrk(2) is not supported.
Remove autoconf code that explicitly disabled libgcc-based backtracing
on i[3456]86. There is no mention of which platforms/compilers
exhibited problems when this code was added, and chances are good that
any gcc toolchain issues have long since been fixed.
Specify -fno-omit-frame-pointer when using __builtin_frame_address() and
__builtin_return_address() for backtracing. This fixes backtracing
failures on e.g. i686 for optimized builds.
The hash code, which has MurmurHash3 at its core, generates different
output depending on system endianness, so adapt the expected output on
big-endian systems. MurmurHash3 code also makes the assumption that
unaligned access is okay (not true on all systems), but jemalloc only
hashes data structures that have sufficient alignment to dodge this
limitation.
Make sure that emmintrin.h can be #include'd without causing a
compilation error, rather than blindly defining HAVE_SSE2 based on
architecture. Attempts to force SSE2 compilation on a 32-bit Ubuntu
13.10 system running as a VMware guest resulted in a no-win choice
without any obvious explanation besides toolchain misconfiguration/bug:
- Suffer compilation failure due to __MMX__, __SSE__, and __SSE2__ not
being defined, even if -mmmx, -msse, and -msse2 are manually
specified (note that they appear to be enabled by default).
- Manually define __MMX__, __SSE__, and __SSE2__, and suffer compiler
warnings that they are already automatically defined. This results in
successful compilation and execution, but the noise is intolerable.