Currently pprof will print output for all threads if a single thread is not
specified, but this doesn't play well with many output formats (e.g., any of
the dot-based formats). Instead, default to printing just the overall profile
when no specific thread is requested.
This resolves#157.
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.
This resolves#172.
Fix OOM cleanup in huge_palloc() to call idalloct() rather than
base_node_dalloc(). This bug is a result of incomplete refactoring, and
has no impact other than leaking memory during OOM.
This provides in-place expansion of huge allocations when the end of the
allocation is at the end of the sbrk heap. There's already the ability
to extend in-place via recycled chunks but this handles the initial
growth of the heap via repeated vector / string reallocations.
A possible future extension could allow realloc to go from the following:
| huge allocation | recycled chunks |
^ dss_end
To a larger allocation built from recycled *and* new chunks:
| huge allocation |
^ dss_end
Doing that would involve teaching the chunk recycling code to request
new chunks to satisfy the request. The chunk_dss code wouldn't require
any further changes.
#include <stdlib.h>
int main(void) {
size_t chunk = 4 * 1024 * 1024;
void *ptr = NULL;
for (size_t size = chunk; size < chunk * 128; size *= 2) {
ptr = realloc(ptr, size);
if (!ptr) return 1;
}
}
dss:secondary: 0.083s
dss:primary: 0.083s
After:
dss:secondary: 0.083s
dss:primary: 0.003s
The dss heap grows in the upwards direction, so the oldest chunks are at
the low addresses and they are used first. Linux prefers to grow the
mmap heap downwards, so the trick will not work in the *current* mmap
chunk allocator as a huge allocation will only be at the top of the heap
in a contrived case.
Fix quarantine to actually update tsd when expanding, and to avoid
double initialization (leaking the first quarantine) due to recursive
initialization.
This resolves#161.
* use sized deallocation in iralloct_realign
* iralloc and ixalloc always need the old size, so pass it in from the
caller where it's often already calculated
The size of the source allocation is known at this point, so reading the
chunk header can be avoided for the small size class fast path. This is
not very useful right now, but it provides a significant performance
boost with an alternate ralloc entry point taking the old size.
Purge trailing pages during shrinking huge reallocation when resulting
size is not a multiple of the chunk size. Similarly, zero pages if
necessary during growing huge reallocation when the resulting size is
not a multiple of the chunk size.
Add the 'util' column, which reports the proportion of available regions
that are currently in use for each small size class. Small run
utilization is the complement of external fragmentation. For example,
utilization of 0.75 indicates that 25% of small run memory is consumed
by external fragmentation, in other (more obtuse) words, 33% external
fragmentation overhead.
This resolves#27.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
stats.arenas.<i>.hchunks.<j>.ndalloc,
stats.arenas.<i>.hchunks.<j>.nrequests, and
stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation. This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.
Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t). These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
Add:
--with-lg-page
--with-lg-page-sizes
--with-lg-size-class-group
--with-lg-quantum
Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.
Fix various edge conditions exposed by the configure options.
atexit(3) can deadlock internally during its own initialization if
jemalloc calls atexit() during jemalloc initialization. Mitigate the
impact by restructuring prof initialization to avoid calling atexit()
unless the registered function will actually dump a final heap profile.
Additionally, disable prof_final by default so that this land mine is
opt-in rather than opt-out.
This resolves#144.
This avoids grabbing the base mutex, as a step towards fine-grained
locking for huge allocations. The thread cache also provides a tiny
(~3%) improvement for serial huge allocations.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind]. Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.
Add a tsd-based arenas_cache, which amortizes arenas reads. This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.
Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).
Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).
Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used. Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes. Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB). Large size classes range from [16 KiB .. 4 MiB). Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).