Add HUGE_MAXCLASS overflow checks that are specific to heap profiling
code paths. This fixes test failures that were introduced by
0c516a00c4 (Make *allocx() size class
overflow behavior defined.).
Refactor the arenas array, which contains pointers to all extant arenas,
such that it starts out as a sparse array of maximum size, and use
double-checked atomics-based reads as the basis for fast and simple
arena_get(). Additionally, reduce arenas_lock's role such that it only
protects against arena initalization races. These changes remove the
possibility for arena lookups to trigger locking, which resolves at
least one known (fork-related) deadlock.
This resolves#315.
Fix arena_size arena_new() computation to incorporate
runs_avail_nclasses elements for runs_avail, rather than
(runs_avail_nclasses - 1) elements. Since offsetof(arena_t, runs_avail)
is used rather than sizeof(arena_t) for the first term of the
computation, all of the runs_avail elements must be added into the
second term.
This bug was introduced (by Jason Evans) while merging pull request #330
as 3417a304cc (Separate arena_avail
trees).
Merge of 3417a304cc looks like a small
bug: first_best_fit doesn't scan through all the classes, since ind is
offset from runs_avail_nclasses by run_avail_bias.
Attempt mmap-based in-place huge reallocation by plumbing new_addr into
chunk_alloc_mmap(). This can dramatically speed up incremental huge
reallocation.
This resolves#335.
Separate run trees by index, replacing the previous quantize logic.
Quantization by index is now performed only on insertion / removal from
the tree, and not on node comparison, saving some cpu. This also means
we don't have to dereference the miscelm* pointers, saving half of the
memory loads from miscelms/mapbits that have fallen out of cache. A
linear scan of the indicies appears to be fast enough.
The only cost of this is an extra tree array in each arena.
In practice this bug had limited impact (and then only by increasing
chunk fragmentation) because run_quantize_ceil() returned correct
results except for inputs that could only arise from aligned allocation
requests that required more than page alignment.
This bug existed in the original run quantization implementation, which
was introduced by 8a03cf039c (Implement
cache index randomization for large allocations.).
Use a single uint64_t in nstime_t to store nanoseconds rather than using
struct timespec. This reduces fragility around conversions between long
and uint64_t, especially missing casts that only cause problems on
32-bit platforms.
This is an alternative to the existing ratio-based unused dirty page
purging, and is intended to eventually become the sole purging
mechanism.
Add mallctls:
- opt.purge
- opt.decay_time
- arena.<i>.decay
- arena.<i>.decay_time
- arenas.decay_time
- stats.arenas.<i>.decay_time
This resolves#325.
When using LinuxThreads, malloc bootstrapping deadlocks, since
malloc_tsd_boot0() ends up calling pthread_setspecific(), which causes
recursive allocation. Fix it by moving the malloc_tsd_boot0() call to
malloc_init_hard_recursible().
The deadlock was introduced by 8bb3198f72
(Refactor/fix arenas manipulation.), when tsd_boot() was split and the
top half, tsd_boot0(), got an extra tsd_wrapper_set() call.
- Combine multiple runtime branches into a single malloc_slow check.
- Avoid calling arena_choose / size2index / index2size on fast path.
- A few micro optimizations.
Fix xallocx(..., MALLOCX_ZERO to zero the last full trailing page of
large allocations that have been randomly assigned an offset of 0 when
--enable-cache-oblivious configure option is enabled. This addresses a
special case missed in d260f442ce (Fix
xallocx(..., MALLOCX_ZERO) bugs.).
Zero all trailing bytes of large allocations when
--enable-cache-oblivious configure option is enabled. This regression
was introduced by 8a03cf039c (Implement
cache index randomization for large allocations.).
Zero trailing bytes of huge allocations when resizing from/to a size
class that is not a multiple of the chunk size.
Fix prof_tctx_dump_iter() to filter out nodes that were created after
heap profile dumping started. Prior to this fix, spurious entries with
arbitrary object/byte counts could appear in heap profiles, which
resulted in jeprof inaccuracies or failures.
Simplify imallocx_prof_sample() to always operate on usize rather than
sometimes using size. This avoids redundant usize computations and
more closely fits the style adopted by i[rx]allocx_prof_sample() to fix
sampling bugs.
Fix ixallocx_prof_sample() to never modify nor create sampled small
allocations. xallocx() is in general incapable of moving small
allocations, so this fix removes buggy code without loss of generality.
Add arena_prof_tctx_reset() and use it instead of arena_prof_tctx_set()
when resetting the tctx pointer during reallocation, which happens
whenever an originally sampled reallocated object is not sampled during
reallocation.
This regression was introduced by
594c759f37 (Optimize
arena_prof_tctx_set().)
Make one call to prof_active_get_unlocked() per allocation event, and
use the result throughout the relevant functions that handle an
allocation event. Also add a missing check in prof_realloc(). These
fixes protect allocation events against concurrent prof_active changes.
Fix heap profiling to distinguish among otherwise identical sample sites
with interposed resets (triggered via the "prof.reset" mallctl). This
bug could cause data structure corruption that would most likely result
in a segfault.
When junk filling is enabled, shrinking an allocation fills the bytes
that were previously allocated but now aren't. Purging the chunk before
doing that is just a waste of time.
This resolves#260.
Fix chunk purge hook calls for in-place huge shrinking reallocation to
specify the old chunk size rather than the new chunk size. This bug
caused no correctness issues for the default chunk purge function, but
was visible to custom functions set via the "arena.<i>.chunk_hooks"
mallctl.
This resolves#264.
Fix arenas_cache_cleanup() and arena_get_hard() to handle
allocation/deallocation within the application's thread-specific data
cleanup functions even after arenas_cache is torn down.
This is a more general fix that complements
45e9f66c28 (Fix arenas_cache_cleanup().).
Fix arenas_cache_cleanup() to handle allocation/deallocation within the
application's thread-specific data cleanup functions even after
arenas_cache is torn down.
Don't bitshift by negative amounts when encoding/decoding run sizes in
chunk header maps. This affected systems with page sizes greater than 8
KiB.
Reported by Ingvar Hagelund <ingvar@redpill-linpro.com>.
Only set the unzeroed flag when initializing the entire mapbits entry,
rather than mutating just the unzeroed bit. This simplifies the
possible mapbits state transitions.
Cascade from decommit to purge when purging unused dirty pages, so that
it is possible to decommit cleaned memory rather than just purging. For
non-Windows debug builds, decommit runs rather than purging them, since
this causes access of deallocated runs to segfault.
This resolves#251.
Fix arena_ralloc_large_grow() to properly account for large_pad, so that
in-place large reallocation succeeds when possible, rather than always
failing. This regression was introduced by
8a03cf039c (Implement cache index
randomization for large allocations.)
- Decorate public function with __declspec(allocator) and __declspec(restrict), just like MSVC 1900
- Support JEMALLOC_HAS_RESTRICT by defining the restrict keyword
- Move __declspec(nothrow) between 'void' and '*' so it compiles once more
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls. The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.
Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained. This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.
Fix chunk purging. Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.
This resolves#176 and #201.
Fix huge_ralloc_no_move() to succeed if an allocation request results in
the same usable size as the existing allocation, even if the request
size is smaller than the usable size. This bug did not cause
correctness issues, but it could cause unnecessary moves during
reallocation.
huge_ralloc() passes a size that may not be precisely a size class, so
make huge_palloc() handle the more general case of a size input rather
than usize.
This regression appears to have been introduced by the addition of
in-place huge reallocation; as such it was never incorporated into a
release.
Create and use FMT* macros that are equivalent to the PRI* macros that
inttypes.h defines. This allows uniform use of the Unix-specific format
specifiers, e.g. "%zu", as well as avoiding Windows-specific definitions
of e.g. PRIu64.
Add ffs()/ffsl() support for compiling with gcc.
Extract compatibility definitions of ENOENT, EINVAL, EAGAIN, EPERM,
ENOMEM, and ENORANGE into include/msvc_compat/windows_extra.h and
use the file for tests as well as for core jemalloc code.
Replace JEMALLOC_ATTR(format(printf, ...). with
JEMALLOC_FORMAT_PRINTF(), so that configuration feature tests can
omit the attribute if it would cause extraneous compilation warnings.
As per gcc documentation:
The alloc_size attribute is used to tell the compiler that the function
return value points to memory (...)
This resolves#245.
This effectively reverts 97c04a9383 (Use
first-fit rather than first-best-fit run/chunk allocation.). In some
pathological cases, first-fit search dominates allocation time, and it
also tends not to converge as readily on a steady state of memory
layout, since precise allocation order has a bigger effect than for
first-best-fit.
Add various function attributes to the exported functions to give the
compiler more information to work with during optimization, and also
specify throw() when compiling with C++ on Linux, in order to adequately
match what __THROW does in glibc.
This resolves#237.
Conditionally define ENOENT, EINVAL, etc. (was unconditional).
Add/use PRIzu, PRIzd, and PRIzx for use in malloc_printf() calls. gcc issued
(harmless) warnings since e.g. "%zu" should be "%Iu" on Windows, and the
alternative to this workaround would have been to disable the function
attributes which cause gcc to look for type mismatches in formatted printing
function calls.
- Set opt_lg_chunk based on run-time OS setting
- Verify LG_PAGE is compatible with run-time OS setting
- When targeting Windows Vista or newer, use SRWLOCK instead of CRITICAL_SECTION
- When targeting Windows Vista or newer, statically initialize init_lock
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.
Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.
This resolves#232.
This avoids the potential surprise of deallocating an object with one
tcache specified, and having the object cached in a different tcache
once it drains from the quarantine.
Now that small allocation runs have fewer regions due to run metadata
residing in chunk headers, an explicit minimum tcache count is needed to
make sure that tcache adequately amortizes synchronization overhead.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.
Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.
For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size. Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.
Add the --disable-cache-oblivious option, primarily intended for
performance testing.
This resolves#13.
This rename avoids installation collisions with the upstream gperftools.
Additionally, jemalloc's per thread heap profile functionality
introduced an incompatible file format, so it's now worthwhile to
clearly distinguish jemalloc's version of this script from the upstream
version.
This resolves#229.
Fix the shrinking case of huge_ralloc_no_move_similar() to purge the
correct number of pages, at the correct offset. This regression was
introduced by 8d6a3e8321 (Implement
dynamic per arena control over dirty page purging.).
Fix huge_ralloc_no_move_shrink() to purge the correct number of pages.
This bug was introduced by 9673983443
(Purge/zero sub-chunk huge allocations as necessary.).
Fix arena_get() calls that specify refresh_if_missing=false. In
ctl_refresh() and ctl.c's arena_purge(), these calls attempted to only
refresh once, but did so in an unreliable way.
arena_i_lg_dirty_mult_ctl() was simply wrong to pass
refresh_if_missing=false.
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.
This resolves#211.
Add mallctls:
- arenas.lg_dirty_mult is initialized via opt.lg_dirty_mult, and can be
modified to change the initial lg_dirty_mult setting for newly created
arenas.
- arena.<i>.lg_dirty_mult controls an individual arena's dirty page
purging threshold, and synchronously triggers any purging that may be
necessary to maintain the constraint.
- arena.<i>.chunk.purge allows the per arena dirty page purging function
to be replaced.
This resolves#93.
Remove the prof_tctx_state_destroying transitory state and instead add
the tctx_uid field, so that the tuple <thr_uid, tctx_uid> uniquely
identifies a tctx. This assures that tctx's are well ordered even when
more than two with the same thr_uid coexist. A previous attempted fix
based on prof_tctx_state_destroying was only sufficient for protecting
against two coexisting tctx's, but it also introduced a new dumping
race.
These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.) and 764b00023f (Fix a heap
profiling regression.).
Add the prof_tctx_state_destroying transitionary state to fix a race
between a thread destroying a tctx and another thread creating a new
equivalent tctx.
This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
a14bce85 made buferror not take an error code, and make the Windows
code path for buferror use GetLastError, while the alternative code
paths used errno. Then 2a83ed02 made buferror take an error code
again, and while it changed the non-Windows code paths to use that
error code, the Windows code path was not changed accordingly.
Fix prof_tctx_comp() to incorporate tctx state into the comparison.
During a dump it is possible for both a purgatory tctx and an otherwise
equivalent nominal tctx to reside in the tree at the same time.
This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
This tends to more effectively pack active memory toward low addresses.
However, additional tree searches are required in many cases, so whether
this change stands the test of time will depend on real-world
benchmarks.
Treat sizes that round down to the same size class as size-equivalent
in trees that are used to search for first best fit, so that there are
only as many "firsts" as there are size classes. This comes closer to
the ideal of first fit.
Rename "dirty chunks" to "cached chunks", in order to avoid overloading
the term "dirty".
Fix the regression caused by 339c2b23b2
(Fix chunk_unmap() to propagate dirty state.), and actually address what
that change attempted, which is to only purge chunks once, and propagate
whether zeroed pages resulted into chunk_record().
Fix chunk_unmap() to propagate whether a chunk is dirty, and modify
dirty chunk purging to record this information so it can be passed to
chunk_unmap(). Since the broken version of chunk_unmap() claimed that
all chunks were clean, this resulted in potential memory corruption for
purging implementations that do not zero (e.g. MADV_FREE).
This regression was introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs. Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty. Preferrentially
allocate dirty chunks. When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).
Refactor extent_node_t to provide accessor functions.
8ddc93293c switched this to over using the
address tree in order to avoid false negatives, so it now needs to check
that the size of the free extent is large enough to satisfy the request.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.
Add chunk node caching to arenas, in order to avoid contention on the
base allocator.
Use chunks_rtree to look up huge allocations rather than a red-black
tree. Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).
Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled. The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.
Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering. These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.
Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.
Add the tcache.create, tcache.flush, and tcache.destroy mallctls.
This resolves#145.
Recent huge allocation refactoring associates huge allocations with
arenas, but it remains necessary to quickly look up huge allocation
metadata during reallocation/deallocation. A global radix tree remains
a good solution to this problem, but locking would have become the
primary bottleneck after (upcoming) migration of chunk management from
global to per arena data structures.
This lock-free implementation uses double-checked reads to traverse the
tree, so that in the steady state, each read or write requires only a
single atomic operation.
This implementation also assures that no more than two tree levels
actually exist, through a combination of careful virtual memory
allocation which makes large sparse nodes cheap, and skipping the root
node on x64 (possible because the top 16 bits are all 0 in practice).
Refactor base_alloc() to guarantee that allocations are carved from
demand-zeroed virtual memory. This supports sparse data structures such
as multi-page radix tree nodes.
Enhance base_alloc() to keep track of fragments which were too small to
support previous allocation requests, and try to consume them during
subsequent requests. This becomes important when request sizes commonly
approach or exceed the chunk size (as could radix tree node
allocations).
Fix chunk_recycle()'s new_addr functionality to search by address rather
than just size if new_addr is specified. The functionality added by
a95018ee81 (Attempt to expand huge
allocations in-place.) only worked if the two search orders happened to
return the same results (e.g. in simple test cases).
The documentation for opt.lg_dirty_mult says:
Per-arena minimum ratio (log base 2) of active to dirty
pages. Some dirty unused pages may be allowed to accumulate,
within the limit set by the ratio (or one chunk worth of dirty
pages, whichever is greater) (...)
The restriction in parentheses currently doesn't happen. This makes
jemalloc aggressively madvise(), which in turns increases the amount
of page faults significantly.
For instance, this resulted in several(!) hundred(!) milliseconds
startup regression on Firefox for Android.
This may require further tweaking, but starting with actually doing
what the documentation says is a good start.
This feature makes it possible to toggle the gdump feature on/off during
program execution, whereas the the opt.prof_dump mallctl value can only
be set during program startup.
This resolves#72.
Avoid calling chunk_recycle() for mmap()ed chunks if config_munmap is
disabled, in which case there are never any recyclable chunks.
This resolves#164.
There are three categories of metadata:
- Base allocations are used for bootstrap-sensitive internal allocator
data structures.
- Arena chunk headers comprise pages which track the states of the
non-metadata pages.
- Internal allocations differ from application-originated allocations
in that they are for internal use, and that they are omitted from heap
profiles.
The metadata statistics comprise the metadata categories as follows:
- stats.metadata: All metadata -- base + arena chunk headers + internal
allocations.
- stats.arenas.<i>.metadata.mapped: Arena chunk headers.
- stats.arenas.<i>.metadata.allocated: Internal allocations. This is
reported separately from the other metadata statistics because it
overlaps with the allocated and active statistics, whereas the other
metadata statistics do not.
Base allocations are not reported separately, though their magnitude can
be computed by subtracting the arena-specific metadata.
This resolves#163.
Refactor bootstrapping to delay tsd initialization, primarily to support
integration with FreeBSD's libc.
Refactor a0*() for internal-only use, and add the
bootstrap_{malloc,calloc,free}() API for use by FreeBSD's libc. This
separation limits use of the a0*() functions to metadata allocation,
which doesn't require malloc/calloc/free API compatibility.
This resolves#170.
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.
This resolves#172.
Fix OOM cleanup in huge_palloc() to call idalloct() rather than
base_node_dalloc(). This bug is a result of incomplete refactoring, and
has no impact other than leaking memory during OOM.
This provides in-place expansion of huge allocations when the end of the
allocation is at the end of the sbrk heap. There's already the ability
to extend in-place via recycled chunks but this handles the initial
growth of the heap via repeated vector / string reallocations.
A possible future extension could allow realloc to go from the following:
| huge allocation | recycled chunks |
^ dss_end
To a larger allocation built from recycled *and* new chunks:
| huge allocation |
^ dss_end
Doing that would involve teaching the chunk recycling code to request
new chunks to satisfy the request. The chunk_dss code wouldn't require
any further changes.
#include <stdlib.h>
int main(void) {
size_t chunk = 4 * 1024 * 1024;
void *ptr = NULL;
for (size_t size = chunk; size < chunk * 128; size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
}
}
dss:secondary: 0.083s
dss:primary: 0.083s
After:
dss:secondary: 0.083s
dss:primary: 0.003s
The dss heap grows in the upwards direction, so the oldest chunks are at
the low addresses and they are used first. Linux prefers to grow the
mmap heap downwards, so the trick will not work in the *current* mmap
chunk allocator as a huge allocation will only be at the top of the heap
in a contrived case.
Fix quarantine to actually update tsd when expanding, and to avoid
double initialization (leaking the first quarantine) due to recursive
initialization.
This resolves#161.
* use sized deallocation in iralloct_realign
* iralloc and ixalloc always need the old size, so pass it in from the
caller where it's often already calculated
The size of the source allocation is known at this point, so reading the
chunk header can be avoided for the small size class fast path. This is
not very useful right now, but it provides a significant performance
boost with an alternate ralloc entry point taking the old size.
Purge trailing pages during shrinking huge reallocation when resulting
size is not a multiple of the chunk size. Similarly, zero pages if
necessary during growing huge reallocation when the resulting size is
not a multiple of the chunk size.
Add the 'util' column, which reports the proportion of available regions
that are currently in use for each small size class. Small run
utilization is the complement of external fragmentation. For example,
utilization of 0.75 indicates that 25% of small run memory is consumed
by external fragmentation, in other (more obtuse) words, 33% external
fragmentation overhead.
This resolves#27.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
stats.arenas.<i>.hchunks.<j>.ndalloc,
stats.arenas.<i>.hchunks.<j>.nrequests, and
stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation. This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.
Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t). These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
Add:
--with-lg-page
--with-lg-page-sizes
--with-lg-size-class-group
--with-lg-quantum
Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.
Fix various edge conditions exposed by the configure options.
atexit(3) can deadlock internally during its own initialization if
jemalloc calls atexit() during jemalloc initialization. Mitigate the
impact by restructuring prof initialization to avoid calling atexit()
unless the registered function will actually dump a final heap profile.
Additionally, disable prof_final by default so that this land mine is
opt-in rather than opt-out.
This resolves#144.