Commit Graph

1271 Commits

Author SHA1 Message Date
Qi Wang
73713fbb27 Drop high rank locks when creating threads.
Avoid holding arenas_lock and background_thread_lock when creating background
threads, because pthread_create may take internal locks, and potentially cause
deadlock with jemalloc internal locks.
2017-06-08 10:02:18 -07:00
Qi Wang
00869e39a3 Make tsd no-cleanup during tsd reincarnation.
Since tsd cleanup isn't guaranteed when reincarnated, we set up tsd in a way
that needs no cleanup, by making it going through slow path instead.
2017-06-07 11:03:49 -07:00
Qi Wang
3a813946fb Take background thread lock when setting extent hooks. 2017-06-05 10:56:25 -07:00
Qi Wang
340071f0cf Set isthreaded when enabling background_thread. 2017-06-01 17:34:49 -07:00
Jason Evans
b511232fcd Refactor/fix background_thread/percpu_arena bootstrapping.
Refactor bootstrapping such that dlsym() is called during the
bootstrapping phase that can tolerate reentrant allocation.
2017-06-01 08:55:27 -07:00
David Goldblatt
8261e581be Header refactoring: Pull size helpers out of jemalloc module. 2017-05-31 13:08:45 -07:00
David Goldblatt
041e041e1f Header refactoring: unify and de-catchall mutex_pool. 2017-05-31 13:08:45 -07:00
David Goldblatt
98774e64a4 Header refactoring: unify and de-catchall extent_mmap module. 2017-05-31 13:08:45 -07:00
David Goldblatt
93284bb53d Header refactoring: unify and de-catchall extent_dss. 2017-05-31 13:08:45 -07:00
David Goldblatt
44f9bd147a Header refactoring: unify and de-catchall rtree module. 2017-05-31 13:08:45 -07:00
Jason Evans
c606a87d2a Add the --disable-thp option to support cross compiling.
This resolves #669.
2017-05-30 11:30:54 -07:00
Jason Evans
168793a1c1 Fix extent_grow_next management.
Fix management of extent_grow_next to serialize operations that may grow
retained memory.  This assures that the sizes of the newly allocated
extents correspond to the size classes in the intended growth sequence.

Fix management of extent_grow_next to skip size classes if a request is
too large to be satisfied by the next size in the growth sequence.  This
avoids the potential for an arbitrary number of requests to bypass
triggering extent_grow_next increases.

This resolves #858.
2017-05-29 17:27:18 -07:00
Qi Wang
d5ef5ae934 Add opt.stats_print_opts.
The value is passed to atexit(3)-triggered malloc_stats_print() calls.
2017-05-29 11:54:00 -07:00
Qi Wang
b86d271cbf Added opt_abort_conf: abort on invalid config options. 2017-05-26 21:14:28 -07:00
Qi Wang
927239b910 Cleanup smoothstep.sh / .h.
h_step_sum was used to compute moving sum.  Not in use anymore.
2017-05-25 16:52:10 -07:00
David Goldblatt
18ecbfa89e Header refactoring: unify and de-catchall mutex module 2017-05-24 15:27:30 -07:00
David Goldblatt
9f822a1fd7 Header refactoring: unify and de-catchall witness code. 2017-05-24 15:27:30 -07:00
Jason Evans
36195c8f4d Disable percpu_arena by default. 2017-05-23 15:32:50 -07:00
Qi Wang
eeefdf3ce8 Fix # of unpurged pages in decay algorithm.
When # of dirty pages move below npages_limit (e.g. they are reused), we should
not lower number of unpurged pages because that would cause the reused pages to
be double counted in the backlog (as a result, decay happen slower than it
should).  Instead, set number of unpurged to the greater of current npages and
npages_limit.

Added an assertion: the ceiling # of pages should be greater than npages_limit.
2017-05-23 13:48:30 -07:00
Qi Wang
0eae838b0d Check for background thread inactivity on extents_dalloc.
To avoid background threads sleeping forever with idle arenas, we eagerly check
background threads' sleep time after extents_dalloc, and signal the thread if
necessary.
2017-05-23 12:26:20 -07:00
Qi Wang
5f5ed2198e Add profiling for the background thread mutex. 2017-05-23 12:26:20 -07:00
Qi Wang
2bee0c6251 Add background thread related stats. 2017-05-23 12:26:20 -07:00
Qi Wang
b693c7868e Implementing opt.background_thread.
Added opt.background_thread to enable background threads, which handles purging
currently.  When enabled, decay ticks will not trigger purging (which will be
left to the background threads).  We limit the max number of threads to NCPUs.
When percpu arena is enabled, set CPU affinity for the background threads as
well.

The sleep interval of background threads is dynamic and determined by computing
number of pages to purge in the future (based on backlog).
2017-05-23 12:26:20 -07:00
David Goldblatt
3f685e8824 Protect the rtree/extent interactions with a mutex pool.
Instead of embedding a lock bit in rtree leaf elements, we associate extents
with a small set of mutexes.  This gets us two things:

- We can use the system mutexes.  This (hypothetically) protects us from
  priority inversion, and lets us stop doing a backoff/sleep loop, instead
  opting for precise wakeups from the mutex.
- Cuts down on the number of mutex acquisitions we have to do (from 4 in the
  worst case to two).

We end up simplifying most of the rtree code (which no longer has to deal with
locking or concurrency at all), at the cost of additional complexity in the
extent code: since the mutex protecting the rtree leaf elements is determined by
reading the extent out of those elements, the initial read is racy, so that we
may acquire an out of date mutex.  We re-check the extent in the leaf after
acquiring the mutex to protect us from this race.
2017-05-19 14:21:27 -07:00
David Goldblatt
26c792e61a Allow mutexes to take a lock ordering enum at construction.
This lets us specify whether and how mutexes of the same rank are allowed to be
acquired.  Currently, we only allow two polices (only a single mutex at a given
rank at a time, and mutexes acquired in ascending order), but we can plausibly
allow more (e.g. the "release uncontended mutexes before blocking").
2017-05-19 14:21:27 -07:00
Jason Evans
6e62c62862 Refactor *decay_time into *decay_ms.
Support millisecond resolution for decay times.  Among other use cases
this makes it possible to specify a short initial dirty-->muzzy decay
phase, followed by a longer muzzy-->clean decay phase.

This resolves #812.
2017-05-18 11:33:45 -07:00
Qi Wang
baf3e294e0 Add stats: arena uptime. 2017-05-18 10:04:28 -07:00
Jason Evans
18a83681cf Refactor (MALLOCX_ARENA_MAX + 1) to be MALLOCX_ARENA_LIMIT.
This resolves #673.
2017-05-14 10:14:23 -07:00
Jason Evans
909f0482e4 Automatically generate private symbol name mangling macros.
Rather than using a manually maintained list of internal symbols to
drive name mangling, add a compilation phase to automatically extract
the list of internal symbols.

This resolves #677.
2017-05-11 23:06:54 -07:00
Jason Evans
a4ae9707da Remove unused private_unnamespace infrastructure. 2017-05-11 23:06:54 -07:00
Jason Evans
a268af5085 Stop depending on JEMALLOC_N() for function interception during testing.
Instead, always define function pointers for interceptable functions,
but mark them const unless testing, so that the compiler can optimize
out the pointer dereferences.
2017-05-11 23:06:54 -07:00
Jason Evans
81ef365622 Avoid compiler warnings on Windows. 2017-05-11 18:06:20 -07:00
Jason Evans
11d2f39d96 Remove mutex_prof_data_t redeclaration.
Redeclaration causes compilations failures with e.g. gcc 4.2.1 on
FreeBSD.  This regression was introduced by
89e2d3c12b (Header refactoring: ctl -
unify and remove from catchall.).
2017-05-11 10:49:43 -07:00
Jason Evans
0798fe6e70 Fix rtree_leaf_elm_szind_slab_update().
Re-read the leaf element when atomic CAS fails due to a race with
another thread that has locked the leaf element, since
atomic_compare_exchange_strong_p() overwrites the expected value with
the actual value on failure.  This regression was introduced by
0ee0e0c155 (Implement compact rtree leaf
element representation.).

This resolves #798.
2017-05-03 08:52:33 -07:00
Jason Evans
344dd342dd rtree_leaf_elm_extent_write() --> rtree_leaf_elm_extent_lock_write()
Refactor rtree_leaf_elm_extent_write() as
rtree_leaf_elm_extent_lock_write(), so that whether the leaf element is
currently acquired is separate from what lock state to write.  This
allows for a relaxed atomic read when releasing the lock.
2017-05-03 08:52:33 -07:00
Qi Wang
fc1aaf13fe Revert "Use trylock in tcache_bin_flush when possible."
This reverts commit 8584adc451.  Production
results not favorable.  Will investigate separately.
2017-05-01 14:49:42 -07:00
David Goldblatt
209f2926b8 Header refactoring: tsd - cleanup and dependency breaking.
This removes the tsd macros (which are used only for tsd_t in real builds).  We
break up the circular dependencies involving tsd.

We also move all tsd access through getters and setters.  This allows us to
assert that we only touch data when tsd is in a valid state.

We simplify the usages of the x macro trick, removing all the customizability
(get/set, init, cleanup), moving the lifetime logic to tsd_init and tsd_cleanup.
This lets us make initialization order independent of order within tsd_t.
2017-05-01 10:49:56 -07:00
Jason Evans
c86c8f4ffb Add extent_destroy_t and use it during arena destruction.
Add the extent_destroy_t extent destruction hook to extent_hooks_t, and
use it during arena destruction.  This hook explicitly communicates to
the callee that the extent must be destroyed or tracked for later reuse,
lest it be permanently leaked.  Prior to this change, retained extents
could unintentionally be leaked if extent retention was enabled.

This resolves #560.
2017-04-29 09:24:12 -07:00
Jason Evans
b9ab04a191 Refactor !opt.munmap to opt.retain. 2017-04-29 09:24:12 -07:00
Qi Wang
d901a37775 Revert "Use try_flush first in tcache_dalloc."
This reverts commit b0c2a28280.  Production
benchmark shows this caused significant regression in both CPU and memory
consumption.  Will investigate separately later on.
2017-04-28 10:59:04 -07:00
Qi Wang
b0c2a28280 Use try_flush first in tcache_dalloc.
Only do must_flush if try_flush didn't manage to free anything.
2017-04-25 17:21:33 -07:00
Qi Wang
8584adc451 Use trylock in tcache_bin_flush when possible.
During tcache gc, use tcache_bin_try_flush_small / _large so that we can skip
items with their bins locked already.
2017-04-25 17:21:33 -07:00
Qi Wang
05775a3736 Avoid prof_dump during reentrancy. 2017-04-25 12:54:36 -07:00
David Goldblatt
268843ac68 Header refactoring: pages.h - unify and remove from catchall. 2017-04-25 09:51:38 -07:00
David Goldblatt
dab4beb277 Header refactoring: hash - unify and remove from catchall. 2017-04-25 09:51:38 -07:00
David Goldblatt
89e2d3c12b Header refactoring: ctl - unify and remove from catchall.
In order to do this, we introduce the mutex_prof module, which breaks a circular
dependency between ctl and prof.
2017-04-25 09:51:38 -07:00
Jason Evans
c67c3e4a63 Replace --disable-munmap with opt.munmap.
Control use of munmap(2) via a run-time option rather than a
compile-time option (with the same per platform default).  The old
behavior of --disable-munmap can be achieved with
--with-malloc-conf=munmap:false.

This partially resolves #580.
2017-04-24 20:37:16 -07:00
Jason Evans
e2cc6280ed Remove --enable-code-coverage.
This option hasn't been particularly useful since the original pre-3.0.0
push to broaden test coverage.

This partially resolves #580.
2017-04-24 16:33:04 -07:00
Jason Evans
0f63396b23 Remove --disable-cc-silence.
The explicit compiler warning suppression controlled by this option is
universally desirable, so remove the ability to disable suppression.

This partially resolves #580.
2017-04-24 15:02:45 -07:00
Qi Wang
f970c497dc Implement malloc_mutex_trylock() w/ proper stats update. 2017-04-24 13:23:55 -07:00
Jason Evans
af76f0e5d2 Remove --with-lg-tiny-min.
This option isn't useful in practice.

This partially resolves #580.
2017-04-24 11:48:28 -07:00
David Goldblatt
120c7a747f Header refactoring: bitmap - unify and remove from catchall. 2017-04-24 10:33:21 -07:00
David Goldblatt
d6b5c7e0f6 Header refactoring: stats - unify and remove from catchall 2017-04-24 10:33:21 -07:00
David Goldblatt
36abf78aa9 Header refactoring: move smoothstep.h out of the catchall. 2017-04-24 10:33:21 -07:00
David Goldblatt
31b43219db Header refactoring: size_classes module - remove from the catchall 2017-04-24 10:33:21 -07:00
David Goldblatt
68da2361d2 Header refactoring: ckh module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
bf2dc7e678 Header refactoring: ticker module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
fa3ad730c4 Header refactoring: prng module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
4d2e4bf5eb Get rid of most of the various inline macros. 2017-04-24 10:33:21 -07:00
David Goldblatt
425253e2cd Enable -Wundef, when supported.
This can catch bugs in which one header defines a numeric constant, and another
uses it without including the defining header. Undefined preprocessor symbols
expand to '0', so that this will compile fine, silently doing the math wrong.
2017-04-21 17:03:56 -07:00
Jason Evans
3823effe12 Remove --enable-ivsalloc.
Continue to use ivsalloc() when --enable-debug is specified (and add
assertions to guard against 0 size), but stop providing a documented
explicit semantics-changing band-aid to dodge undefined behavior in
sallocx() and malloc_usable_size().  ivsalloc() remains compiled in,
unlike when #211 restored --enable-ivsalloc, and if
JEMALLOC_FORCE_IVSALLOC is defined during compilation, sallocx() and
malloc_usable_size() will still use ivsalloc().

This partially resolves #580.
2017-04-21 14:34:35 -07:00
Jim Chen
ae248a2160 Use openat syscall if available
Some architectures like AArch64 may not have the open syscall because it
was superseded by the openat syscall, so check and use SYS_openat if
SYS_open is not available.

Additionally, Android headers for AArch64 define SYS_open to __NR_open,
even though __NR_open is undefined. Undefine SYS_open in that case so
SYS_openat is used.
2017-04-21 10:58:42 -07:00
Jason Evans
4403c9ab44 Remove --disable-tcache.
Simplify configuration by removing the --disable-tcache option, but
replace the testing for that configuration with
--with-malloc-conf=tcache:false.

Fix the thread.arena and thread.tcache.flush mallctls to work correctly
if tcache is disabled.

This partially resolves #580.
2017-04-21 10:06:12 -07:00
Qi Wang
5aa46f027d Bypass extent tracking for auto arenas.
Tracking extents is required by arena_reset.  To support this, the extent
linkage was used for tracking 1) large allocations, and 2) full slabs.  However
modifying the extent linkage could be an expensive operation as it likely incurs
cache misses.  Since we forbid arena_reset on auto arenas, let's bypass the
linkage operations for auto arenas.
2017-04-21 00:29:18 -07:00
Jason Evans
da4cff0279 Support --with-lg-page values larger than system page size.
All mappings continue to be PAGE-aligned, even if the system page size
is smaller.  This change is primarily intended to provide a mechanism
for supporting multiple page sizes with the same binary; smaller page
sizes work better in conjunction with jemalloc's design.

This resolves #467.
2017-04-18 19:01:04 -07:00
Jason Evans
45f087eb03 Revert "Remove BITMAP_USE_TREE."
Some systems use a native 64 KiB page size, which means that the bitmap
for the smallest size class can be 8192 bits, not just 512 bits as when
the page size is 4 KiB.  Linear search in bitmap_{sfu,ffu}() is
unacceptably slow for such large bitmaps.

This reverts commit 7c00f04ff4.
2017-04-18 19:01:04 -07:00
David Goldblatt
38e847c1c5 Header refactoring: unify spin.h and move it out of the catch-all. 2017-04-18 18:35:03 -07:00
David Goldblatt
418d96a86c Header refactoring: unify nstime.h and move it out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
7ebc83894f Header refactoring: move jemalloc_internal_types.h out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
d9ec36e22d Header refactoring: move assert.h out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
f692e6c214 Header refactoring: move util.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
54373be084 Header refactoring: move malloc_io.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
0b00ffe55f Header refactoring: move bit_util.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
22366518b7 Move CPP_PROLOGUE and CPP_EPILOGUE to the .cpp
This lets us avoid having to specify them in every C file.
2017-04-18 18:35:03 -07:00
Jason Evans
881fbf762f Prefer old/low extent_t structures during reuse.
Rather than using a LIFO queue to track available extent_t structures,
use a red-black tree, and always choose the oldest/lowest available
during reuse.
2017-04-17 14:47:45 -07:00
Jason Evans
76b35f4b2f Track extent structure serial number (esn) in extent_t.
This enables stable sorting of extent_t structures.
2017-04-17 14:47:45 -07:00
Jason Evans
69aa552809 Allocate increasingly large base blocks.
Limit the total number of base block by leveraging the exponential
size class sequence, similarly to extent_grow_retained().
2017-04-17 14:47:45 -07:00
Qi Wang
3c9c41edb2 Improve rtree cache with a two-level cache design.
Two levels of rcache is implemented: a direct mapped cache as L1, combined with
a LRU cache as L2.  The L1 cache offers low cost on cache hit, but could suffer
collision under circumstances.  This is complemented by the L2 LRU cache, which
is slower on cache access (overhead from linear search + reordering), but solves
collison of L1 rather well.
2017-04-17 12:05:23 -07:00
Qi Wang
d16f1e53df Skip percpu arena when choosing iarena. 2017-04-16 21:34:44 -07:00
Qi Wang
c2fcf9c2cf Switch to fine-grained reentrancy support.
Previously we had a general detection and support of reentrancy, at the cost of
having branches and inc / dec operations on fast paths.  To avoid taxing fast
paths, we move the reentrancy operations onto tsd slow state, and only modify
reentrancy level around external calls (that might trigger reentrancy).
2017-04-14 19:48:06 -07:00
Qi Wang
b348ba29bb Bundle 3 branches on fast path into tsd_state.
Added tsd_state_nominal_slow, which on fast path malloc() incorporates
tcache_enabled check, and on fast path free() bundles both malloc_slow and
tcache_enabled branches.
2017-04-14 16:58:08 -07:00
Qi Wang
ccfe68a916 Pass alloc_ctx down profiling path.
With this change, when profiling is enabled, we avoid doing redundant rtree
lookups. Also changed dalloc_atx_t to alloc_atx_t, as it's now used on
allocation path as well (to speed up profiling).
2017-04-12 13:55:39 -07:00
Qi Wang
f35213bae4 Pass dalloc_ctx down the sdalloc path.
This avoids redundant rtree lookups.
2017-04-12 13:55:39 -07:00
David Goldblatt
e709fae1d7 Header refactoring: move atomic.h out of the catch-all 2017-04-11 11:52:30 -07:00
David Goldblatt
743d940dc3 Header refactoring: Split up jemalloc_internal.h
This is a biggy.  jemalloc_internal.h has been doing multiple jobs for a while
now:
- The source of system-wide definitions.
- The catch-all include file.
- The module header file for jemalloc.c

This commit splits up this functionality.  The system-wide definitions
responsibility has moved to jemalloc_preamble.h.  The catch-all include file is
now jemalloc_internal_includes.h.  The module headers for jemalloc.c are now in
jemalloc_internal_[externs|inlines|types].h, just as they are for the other
modules.
2017-04-11 11:52:30 -07:00
David Goldblatt
0237870c60 Header refactoring: break out ql.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
610cb83419 Header refactoring: break out qr.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
63a5cd4cc2 Header refactoring: break out rb.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
2f00ce4da7 Header refactoring: break out ph.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
57e36e1a12 Header refactoring: Add CPP_PROLOGUE and CPP_EPILOGUE macros 2017-04-11 11:52:30 -07:00
Qi Wang
bfa530b75b Pass dealloc_ctx down free() fast path.
This gets rid of the redundent rtree lookup down fast path.
2017-04-11 09:58:12 -07:00
Qi Wang
04ef218d87 Move reentrancy_level to the beginning of TSD. 2017-04-07 16:25:43 -07:00
David Goldblatt
b407a65401 Add basic reentrancy-checking support, and allow arena_new to reenter.
This checks whether or not we're reentrant using thread-local data, and, if we
are, moves certain internal allocations to use arena 0 (which should be properly
initialized after bootstrapping).

The immediate thing this allows is spinning up threads in arena_new, which will
enable spinning up background threads there.
2017-04-07 14:10:27 -07:00
David Goldblatt
0a0fcd3e6a Add hooking functionality
This allows us to hook chosen functions and do interesting things there (in
particular: reentrancy checking).
2017-04-07 14:10:27 -07:00
Qi Wang
36bd90b962 Optimizing TSD and thread cache layout.
1) Re-organize TSD so that frequently accessed fields are closer to the
beginning and more compact.  Assuming 64-bit, the first 2.5 cachelines now
contains everything needed on tcache fast path, expect the tcache struct itself.

2) Re-organize tcache and tbins.  Take lg_fill_div out of tbin, and reduce tbin
to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and
place tbins_small close to the beginning.
2017-04-07 14:06:17 -07:00
Qi Wang
0fba57e579 Get rid of tcache_enabled_t as we have runtime init support. 2017-04-07 10:42:29 -07:00
Qi Wang
fde3e20cc0 Integrate auto tcache into TSD.
The embedded tcache is initialized upon tsd initialization.  The avail arrays
for the tbins will be allocated / deallocated accordingly during init / cleanup.

With this change, the pointer to the auto tcache will always be available, as
long as we have access to the TSD.  tcache_available() (called in tcache_get())
is provided to check if we should use tcache.
2017-04-07 09:55:14 -07:00
David Goldblatt
eeabdd2466 Remove the pre-C11-atomics API, which is now unused 2017-04-05 16:25:37 -07:00
David Goldblatt
5dcc13b342 Make the mutex n_waiting_thds field a C11-style atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
30d74db08e Convert accumbytes in prof_accum_t to C11 atomics, when possible 2017-04-05 16:25:37 -07:00
David Goldblatt
92aafb0efe Make base_t's extent_hooks field C11-atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
56b72c7b17 Transition arena struct fields to C11 atomics 2017-04-05 16:25:37 -07:00
David Goldblatt
bc32ec3503 Move arena-tracking atomics in jemalloc.c to C11-style 2017-04-05 16:25:37 -07:00
David Goldblatt
864adb7f42 Transition e_prof_tctx in struct extent to C11 atomics 2017-04-04 16:46:04 -07:00
David Goldblatt
7da04a6b09 Convert prng module to use C11-style atomics 2017-04-04 16:45:52 -07:00
Qi Wang
492e9f301e Make the tsd member init functions to take tsd_t * type. 2017-04-04 14:06:07 -07:00
Qi Wang
d3cda3423c Do proper cleanup for tsd_state_reincarnated.
Also enable arena_bind under non-nominal state, as the cleanup will be handled
correctly now.
2017-04-04 00:34:49 -07:00
Qi Wang
51d3682950 Remove the leafkey NULL check in leaf_elm_lookup. 2017-04-04 00:27:35 -07:00
Qi Wang
9ed84b0d45 Add init function support to tsd members.
This will facilitate embedding tcache into tsd, which will require proper
initialization cannot be done via the static initializer.  Make tsd->rtree_ctx
to be initialized via rtree_ctx_data_init().
2017-04-04 00:19:21 -07:00
Jason Evans
07f4f93434 Move arena_slab_data_t's nfree into extent_t's e_bits.
Compact extent_t to 128 bytes on 64-bit systems by moving
arena_slab_data_t's nfree into extent_t's e_bits.

Cacheline-align extent_t structures so that they always cross the
minimum number of cacheline boundaries.

Re-order extent_t fields such that all fields except the slab bitmap
(and overlaid heap profiling context pointer) are in the first
cacheline.

This resolves #461.
2017-03-27 22:43:39 -07:00
Qi Wang
af3d737a9a Simplify rtree cache replacement policy.
To avoid memmove on free() fast path, simplify the cache replacement policy to
only bubble up the cache hit element by 1.
2017-03-27 13:42:31 -07:00
Jason Evans
c6d1819e48 Simplify rtree_clear() to avoid locking. 2017-03-27 13:22:52 -07:00
Jason Evans
4020523f67 Fix a race in rtree_szind_slab_update() for RTREE_LEAF_COMPACT. 2017-03-27 13:22:36 -07:00
Jason Evans
7c00f04ff4 Remove BITMAP_USE_TREE.
Remove tree-structured bitmap support, in order to reduce complexity and
ease maintenance.  No bitmaps larger than 512 bits have been necessary
since before 4.0.0, and there is no current plan that would increase
maximum bitmap size.  Although tree-structured bitmaps were used on
32-bit platforms prior to this change, the overall benefits were
questionable (higher metadata overhead, higher bitmap modification cost,
marginally lower search cost).
2017-03-27 12:18:40 -07:00
Jason Evans
6258176c87 Fix bitmap_ffu() to work with 3+ levels. 2017-03-27 12:18:40 -07:00
Jason Evans
735ad8210c Pack various extent_t fields into a bitfield.
This reduces sizeof(extent_t) from 160 to 136 on x64.
2017-03-25 23:30:13 -07:00
Jason Evans
0591c204b4 Store arena index rather than (arena_t *) in extent_t. 2017-03-25 23:30:13 -07:00
Jason Evans
5e12223925 Fix BITMAP_USE_TREE version of bitmap_ffu().
This fixes an extent searching regression on 32-bit systems, caused by
the initial bitmap_ffu() implementation in
c8021d01f6 (Implement bitmap_ffu(), which
finds the first unset bit.), as first used in
5d33233a5e (Use a bitmap in extents_t to
speed up search.).
2017-03-25 23:29:32 -07:00
Jason Evans
5d33233a5e Use a bitmap in extents_t to speed up search.
Rather than iteratively checking all sufficiently large heaps during
search, maintain and use a bitmap in order to skip empty heaps.
2017-03-24 17:52:46 -07:00
Jason Evans
57e353163f Implement BITMAP_GROUPS(). 2017-03-24 17:52:46 -07:00
Jason Evans
c8021d01f6 Implement bitmap_ffu(), which finds the first unset bit. 2017-03-24 17:52:46 -07:00
Qi Wang
362e356675 Profile per arena base mutex, instead of just a0. 2017-03-23 00:03:28 -07:00
Qi Wang
d3fde1c124 Refactor mutex profiling code with x-macros. 2017-03-23 00:03:28 -07:00
Qi Wang
f6698ec1e6 Switch to nstime_t for the time related fields in mutex profiling. 2017-03-23 00:03:28 -07:00
Qi Wang
74f78cafda Added custom mutex spin.
A fixed max spin count is used -- with benchmark results showing it
solves almost all problems. As the benchmark used was rather intense,
the upper bound could be a little bit high. However it should offer a
good tradeoff between spinning and blocking.
2017-03-23 00:03:28 -07:00
Qi Wang
20b8c70e9f Added extents_dirty / _muzzy mutexes, as well as decay_dirty / _muzzy. 2017-03-23 00:03:28 -07:00
Qi Wang
64c5f5c174 Added "stats.mutexes.reset" mallctl to reset all mutex stats.
Also switched from the term "lock" to "mutex".
2017-03-23 00:03:28 -07:00
Qi Wang
ca9074deff Added lock profiling and output for global locks (ctl, prof and base). 2017-03-23 00:03:28 -07:00
Qi Wang
0fb5c0e853 Add arena lock stats output. 2017-03-23 00:03:28 -07:00
Qi Wang
a4f176af57 Output bin lock profiling results to malloc_stats.
Two counters are included for the small bins: lock contention rate, and
max lock waiting time.
2017-03-23 00:03:28 -07:00
Qi Wang
6309df628f First stage of mutex profiling.
Switched to trylock and update counters based on state.
2017-03-23 00:03:28 -07:00
Jason Evans
32e7cf51cd Further specialize arena_[s]dalloc() tcache fast path.
Use tsd_rtree_ctx() rather than tsdn_rtree_ctx() when tcache is
non-NULL, in order to avoid an extra branch (and potentially extra stack
space) in the fast path.
2017-03-22 18:33:32 -07:00
Jason Evans
5e67fbc367 Push down iealloc() calls.
Call iealloc() as deep into call chains as possible without causing
redundant calls.
2017-03-22 18:33:32 -07:00
Jason Evans
51a2ec92a1 Remove extent dereferences from the deallocation fast paths. 2017-03-22 18:33:32 -07:00
Jason Evans
4f341412e5 Remove extent arg from isalloc() and arena_salloc(). 2017-03-22 18:33:32 -07:00
Jason Evans
0ee0e0c155 Implement compact rtree leaf element representation.
If a single virtual adddress pointer has enough unused bits to pack
{szind_t, extent_t *, bool, bool}, use a single pointer-sized field in
each rtree leaf element, rather than using three separate fields.  This
has little impact on access speed (fewer loads/stores, but more bit
twiddling), except that denser representation increases TLB
effectiveness.
2017-03-22 18:33:32 -07:00
Jason Evans
ce41ab0c57 Embed root node into rtree_t.
This avoids one atomic operation per tree access.
2017-03-22 18:33:32 -07:00
Jason Evans
99d68445ef Incorporate szind/slab into rtree leaves.
Expand and restructure the rtree API such that all common operations can
be achieved with minimal work, regardless of whether the rtree leaf
fields are independent versus packed into a single atomic pointer.
2017-03-22 18:33:32 -07:00
Jason Evans
944c8a3383 Split rtree_elm_t into rtree_{node,leaf}_elm_t.
This allows leaf elements to differ in size from internal node elements.

In principle it would be more correct to use a different type for each
level of the tree, but due to implementation details related to atomic
operations, we use casts anyway, thus counteracting the value of
additional type correctness.  Furthermore, such a scheme would require
function code generation (via cpp macros), as well as either unwieldy
type names for leaves or type aliases, e.g.

  typedef struct rtree_elm_d2_s rtree_leaf_elm_t;

This alternate strategy would be more correct, and with less code
duplication, but probably not worth the complexity.
2017-03-22 18:33:32 -07:00
Jason Evans
f50d6009fe Remove binind field from arena_slab_data_t.
binind is now redundant; the containing extent_t's szind field always
provides the same value.
2017-03-22 18:33:32 -07:00
Jason Evans
e8921cf2eb Convert extent_t's usize to szind.
Rather than storing usize only for large (and prof-promoted)
allocations, store the size class index for allocations that reside
within the extent, such that the size class index is valid for all
extents that contain extant allocations, and invalid otherwise (mainly
to make debugging simpler).
2017-03-22 18:33:32 -07:00
Jason Evans
64e458f5cd Implement two-phase decay-based purging.
Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether.  Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.

The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
  stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}

This resolves #521.
2017-03-15 13:13:47 -07:00
Jason Evans
38a5bfc816 Move arena_t's purging field into arena_decay_t. 2017-03-15 13:13:47 -07:00
Jason Evans
765edd67b4 Refactor decay-related function parametrization.
Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on.  This
prepares for supporting both lazy and forced purging on different decay
schedules.
2017-03-15 13:13:47 -07:00
David Goldblatt
ee202efc79 Convert remaining arena_stats_t fields to atomics
These were all size_ts, so we have atomics support for them on all platforms, so
the conversion is straightforward.

Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
2017-03-13 18:22:33 -07:00
David Goldblatt
4fc2acf5ae Switch atomic uint64_ts in arena_stats_t to C11 atomics
I expect this to be the trickiest conversion we will see, since we want atomics
on 64-bit platforms, but are also always able to piggyback on some sort of
external synchronization on non-64 bit platforms.
2017-03-13 18:22:33 -07:00
Jason Evans
7cbcd2e2b7 Fix pages_purge_forced() to discard pages on non-Linux systems.
madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall
back to overlaying a new mapping.
2017-03-13 18:19:57 -07:00
David Goldblatt
21a68e2d22 Convert rtree code to use C11 atomics
In the process, I changed the implementation of rtree_elm_acquire so that it
won't even try to CAS if its initial read (getting the extent + lock bit)
indicates that the CAS is doomed to fail.  This can significantly improve
performance under contention.
2017-03-13 12:05:27 -07:00
Jason Evans
3a2b183d5f Convert arena_t's purging field to non-atomic bool.
The decay mutex already protects all accesses.
2017-03-10 10:14:30 -08:00
Jason Evans
75fddc786c Fix ATOMIC_{ACQUIRE,RELEASE,ACQ_REL} definitions. 2017-03-09 00:57:37 -08:00
Qi Wang
ec532e2c5c Implement per-CPU arena.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.

"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.

When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
2017-03-08 23:19:01 -08:00
Qi Wang
8721e19c04 Fix arena_prefork lock rank order for witness.
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
2017-03-08 23:07:27 -08:00
David Goldblatt
8adab26972 Convert extents_t's npages field to use C11-style atomics
In the process, we can do some strength reduction, changing the fetch-adds and
fetch-subs to be simple loads followed by stores, since the modifications all
occur while holding the mutex.
2017-03-08 21:27:09 -08:00
David Goldblatt
dafadce622 Reintroduce JEMALLOC_ATOMIC_U64
The C11 atomics backport removed this #define, which degraded atomic 64-bit
reads to require a lock even on platforms that support them.  This commit fixes
that.
2017-03-08 21:26:37 -08:00
Qi Wang
01f47f11a6 Store associated arena in tcache.
This fixes tcache_flush for manual tcaches, which wasn't able to find
the correct arena it associated with. Also changed the decay test to
cover this case (by using manually created arenas).
2017-03-07 12:58:11 -08:00
Jason Evans
cc75c35db5 Add any() and remove_any() to ph.
These functions select the easiest-to-remove element in the heap, which
is either the most recently inserted aux list element or the root.  If
no calls are made to first() or remove_first(), the behavior (and time
complexity) is the same as for a LIFO queue.
2017-03-07 10:25:33 -08:00
Jason Evans
e201e24904 Perform delayed coalescing prior to purging.
Rather than purging uncoalesced extents, perform just enough incremental
coalescing to purge only fully coalesced extents.  In the absence of
cached extent reuse, the immediate versus delayed incremental purging
algorithms result in the same purge order.

This resolves #655.
2017-03-07 10:25:12 -08:00
David Goldblatt
4f1e94658a Change arena to use the atomic functions for ssize_t instead of the union strategy 2017-03-06 18:49:19 -08:00
David Goldblatt
438efede78 Add atomic types for ssize_t 2017-03-06 18:49:19 -08:00
David Goldblatt
424e3428b1 Make type abbreviations consistent: ssize_t is zd everywhere 2017-03-06 18:49:19 -08:00
David Goldblatt
84326c566a Insert not_reached after an exhaustive switch
In the C11 atomics backport, we couldn't use not_reached() in
atomic_enum_to_builtin (in atomic_gcc_atomic.h), since atomic.h was hermetic and
assert.h wasn't; there was a dependency issue.  assert.h is hermetic now, so we
can include it.
2017-03-06 15:08:43 -08:00
David Goldblatt
e9852b5776 Disentangle assert and util
This is the first header refactoring diff, #533.  It splits the assert and util
components into separate, hermetic, header files.  In the process, it splits out
two of the large sub-components of util (the stdio.h replacement, and bit
manipulation routines) into their own components (malloc_io.h and bit_util.h).
This is mostly to break up cyclic dependencies, but it also breaks off a good
chunk of the catch-all-ness of util, which is nice.
2017-03-06 15:08:43 -08:00
Jason Evans
04d8fcb745 Optimize malloc_large_stats_t maintenance.
Convert the nrequests field to be partially derived, and the curlextents
to be fully derived, in order to reduce the number of stats updates
needed during common operations.

This change affects ndalloc stats during arena reset, because it is no
longer possible to cancel out ndalloc effects (curlextents would become
negative).
2017-03-04 08:18:31 -08:00
David Goldblatt
d4ac7582f3 Introduce a backport of C11 atomics
This introduces a backport of C11 atomics.  It has four implementations; ranked
in order of preference, they are:
- GCC/Clang __atomic builtins
- GCC/Clang __sync builtins
- MSVC _Interlocked builtins
- C11 atomics, from <stdatomic.h>

The primary advantages are:
- Close adherence to the standard API gives us a defined memory model.
- Type safety: atomic objects are now separate types from non-atomic ones, so
  that it's impossible to mix up atomic and non-atomic updates (which is
  undefined behavior that compilers are starting to take advantage of).
- Efficiency: we can specify ordering for operations, avoiding fences and
  atomic operations on strongly ordered architectures (example:
  `atomic_write_u32(ptr, val);` involves a CAS loop, whereas
  `atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store.

This diff leaves in the current atomics API (implementing them in terms of the
backport).  This lets us transition uses over piecemeal.

Testing:
This is by nature hard to test. I've manually tested the first three options on
Linux on gcc by futzing with the #defines manually, on freebsd with gcc and
clang, on MSVC, and on OS X with clang.  All of these were x86 machines though,
and we don't have any test infrastructure set up for non-x86 platforms.
2017-03-03 13:40:59 -08:00
David Goldblatt
957b8c5f21 Stop #define-ining away 'inline'
In the long term, we'll transition to C99-style inline semantics.  In the
short-term, this will allow both styles to coexist without breaking one another.
2017-03-03 13:40:59 -08:00
Jason Evans
fd058f572b Immediately purge cached extents if decay_time is 0.
This fixes a regression caused by
54269dc0ed (Remove obsolete
arena_maybe_purge() call.), as well as providing a general fix.

This resolves #665.
2017-03-02 19:43:06 -08:00
Jason Evans
d61a5f76b2 Convert arena_decay_t's time to be atomically synchronized. 2017-03-02 19:43:06 -08:00
Jason Evans
472fef2e12 Fix {allocated,nmalloc,ndalloc,nrequests}_large stats regression.
This fixes a regression introduced by
d433471f58 (Derive
{allocated,nmalloc,ndalloc,nrequests}_large stats.).
2017-02-27 11:18:07 -08:00
Jason Evans
079b8bee37 Tidy up extent quantization.
Remove obsolete unit test scaffolding for extent quantization.  Remove
redundant assertions.  Add an assertion to
extents_first_best_fit_locked() that should help prevent aligned
allocation regressions.
2017-02-27 11:17:47 -08:00
Jason Evans
d727596bcb Update a comment. 2017-02-26 11:05:27 -08:00
Qi Wang
c2323e13a5 Get rid of witness in malloc_mutex_t when !(configured w/ debug).
We don't touch witness at all when config_debug == false.  Let's only pay the
memory cost in malloc_mutex_s when needed. Note that when !config_debug, we keep
the field in a union so that we don't have to do #ifdefs in multiple places.
2017-02-24 09:41:29 -08:00
Jason Evans
8ac7937eb5 Remove remainder of mb (memory barrier).
This complements 94c5d22a4d (Remove mb.h,
which is unused).
2017-02-22 00:24:14 -08:00
Jason Evans
003ca8717f Move arena_basic_stats_merge() prototype (hygienic cleanup). 2017-02-21 12:46:20 -08:00
Jason Evans
2dfc5b5aac Disable coalescing of cached extents.
Extent splitting and coalescing is a major component of large allocation
overhead, and disabling coalescing of cached extents provides a simple
and effective hysteresis mechanism.  Once two-phase purging is
implemented, it will probably make sense to leave coalescing disabled
for the first phase, but coalesce during the second phase.
2017-02-16 20:11:50 -08:00
Jason Evans
b0654b95ed Fix arena->stats.mapped accounting.
Mapped memory increases when extent_alloc_wrapper() succeeds, and
decreases when extent_dalloc_wrapper() is called (during purging).
2017-02-16 15:52:11 -08:00
Jason Evans
f8fee6908d Synchronize arena->decay with arena->decay.mtx.
This removes the last use of arena->lock.
2017-02-16 09:39:46 -08:00
Jason Evans
d433471f58 Derive {allocated,nmalloc,ndalloc,nrequests}_large stats.
This mildly reduces stats update overhead during normal operation.
2017-02-16 09:39:46 -08:00
Jason Evans
ab25d3c987 Synchronize arena->tcache_ql with arena->tcache_ql_mtx.
This replaces arena->lock synchronization.
2017-02-16 09:39:46 -08:00
Jason Evans
6b5cba4191 Convert arena->stats synchronization to atomics. 2017-02-16 09:39:46 -08:00
Jason Evans
fa2d64c94b Convert arena->prof_accumbytes synchronization to atomics. 2017-02-16 09:39:46 -08:00
Jason Evans
b779522b9b Convert arena->dss_prec synchronization to atomics. 2017-02-16 09:39:46 -08:00
Jason Evans
0721b895ff Do not generate unused tsd_*_[gs]et() functions.
This avoids a gcc diagnostic note:
    note: The ABI for passing parameters with 64-byte alignment has
    changed in GCC 4.6
This note related to the cacheline alignment of rtree_ctx_t, which was
introduced by 4a346f5593 (Replace rtree
path cache with LRU cache.).
2017-02-13 10:47:16 -08:00
Jason Evans
6b8ef771a9 Fix rtree_subkey() regression.
Fix rtree_subkey() to use uintptr_t rather than unsigned for key
bitmasking.  This regression was introduced by
4a346f5593 (Replace rtree path cache with
LRU cache.).
2017-02-10 09:05:02 -08:00
Jason Evans
7f55dbef9b Enable mutex witnesses even when !isthreaded.
This fixes interactions with witness_assert_depth[_to_rank](), which was
added in d0e93ada51 (Add
witness_assert_depth[_to_rank]().).
2017-02-09 17:05:47 -08:00
Jason Evans
db7da56359 Spin adaptively in rtree_elm_acquire(). 2017-02-08 18:50:03 -08:00
Jason Evans
de8a68e853 Enhance spin_adaptive() to yield after several iterations.
This avoids worst case behavior if e.g. another thread is preempted
while owning the resource the spinning thread is waiting for.
2017-02-08 18:50:03 -08:00
Jason Evans
5f11830754 Replace spin_init() with SPIN_INITIALIZER. 2017-02-08 18:50:03 -08:00
Jason Evans
650c070e10 Remove rtree support for 0 (NULL) keys.
NULL can never actually be inserted in practice, and removing support
allows a branch to be removed from the fast path.
2017-02-08 18:50:03 -08:00
Jason Evans
f5cf9b19c8 Determine rtree levels at compile time.
Rather than dynamically building a table to aid per level computations,
define a constant table at compile time.  Omit both high and low
insignificant bits.  Use one to three tree levels, depending on the
number of significant bits.
2017-02-08 18:50:03 -08:00
Jason Evans
ff4db5014e Remove rtree leading 0 bit optimization.
A subsequent change instead ignores insignificant high bits.
2017-02-08 18:50:03 -08:00
Jason Evans
cdc240d501 Make non-essential inline rtree functions static functions. 2017-02-08 18:50:03 -08:00
Jason Evans
c511a44e99 Split rtree_elm_lookup_hard() out of rtree_elm_lookup().
Anything but a hit in the first element of the lookup cache is
expensive enough to negate the benefits of inlining.
2017-02-08 18:50:03 -08:00
Jason Evans
4a346f5593 Replace rtree path cache with LRU cache.
Rework rtree_ctx_t to encapsulate an rtree leaf LRU lookup cache rather
than a single-path element lookup cache.  The replacement is logically
much simpler, as well as slightly faster in the fast path case and less
prone to degraded performance during non-trivial sequences of lookups.
2017-02-08 18:50:03 -08:00
Jason Evans
0ecf692726 Optimize a branch out of rtree_read() if !dependent. 2017-02-08 18:50:03 -08:00
Jason Evans
d27f29b468 Disentangle arena and extent locking.
Refactor arena and extent locking protocols such that arena and
extent locks are never held when calling into the extent_*_wrapper()
API.  This requires extra care during purging since the arena lock no
longer protects the inner purging logic.  It also requires extra care to
protect extents from being merged with adjacent extents.

Convert extent_t's 'active' flag to an enumerated 'state', so that
retained extents are explicitly marked as such, rather than depending on
ring linkage state.

Refactor the extent collections (and their synchronization) for cached
and retained extents into extents_t.  Incorporate LRU functionality to
support purging.  Incorporate page count accounting, which replaces
arena->ndirty and arena->stats.retained.

Assert that no core locks are held when entering any internal
[de]allocation functions.  This is in addition to existing assertions
that no locks are held when entering external [de]allocation functions.

Audit and document synchronization protocols for all arena_t fields.

This fixes a potential deadlock due to recursive allocation during
gdump, in a similar fashion to b49c649bc1
(Fix lock order reversal during gdump.), but with a necessarily much
broader code impact.
2017-02-01 16:43:46 -08:00
Jason Evans
1b6e43507e Fix/refactor tcaches synchronization.
Synchronize tcaches with tcaches_mtx rather than ctl_mtx.  Add missing
synchronization for tcache flushing.  This bug was introduced by
1cb181ed63 (Implement explicit tcache
support.), which was first released in 4.0.0.
2017-02-01 16:43:46 -08:00
Jason Evans
d0e93ada51 Add witness_assert_depth[_to_rank]().
This makes it possible to make lock state assertions about precisely
which locks are held.
2017-02-01 16:43:46 -08:00
Jason Evans
c0cc5db871 Replace tabs following #define with spaces.
This resolves #564.
2017-01-20 21:45:53 -08:00
Jason Evans
f408643a4c Remove extraneous parens around return arguments.
This resolves #540.
2017-01-20 21:43:07 -08:00
Jason Evans
c4c2592c83 Update brace style.
Add braces around single-line blocks, and remove line breaks before
function-opening braces.

This resolves #537.
2017-01-20 21:43:07 -08:00
Jason Evans
9eb1b1c881 Fix --disable-stats support.
Fix numerous regressions that were exposed by --disable-stats, both in
the core library and in the tests.
2017-01-19 18:31:07 -08:00
Qi Wang
58424e679d Added stats about number of bytes cached in tcache currently. 2017-01-18 10:55:21 -08:00
Mike Hommey
0f7376eb62 Don't rely on OSX SDK malloc/malloc.h for malloc_zone struct definitions
The SDK jemalloc is built against might be not be the latest for various
reasons, but the resulting binary ought to work on newer versions of
OSX.

In order to ensure this, we need the fullest definitions possible, so
copy what we need from the latest version of malloc/malloc.h available
on opensource.apple.com.
2017-01-17 20:13:28 -08:00
Jason Evans
1ff09534b5 Fix prof_realloc() regression.
Mostly revert the prof_realloc() changes in
498856f44a (Move slabs out of chunks.) so
that prof_free_sampled_object() is called when appropriate.  Leave the
prof_tctx_[re]set() optimization in place, but add an assertion to
verify that all eight cases are correctly handled.  Add a comment to
make clear the code ordering, so that the regression originally fixed by
ea8d97b897 (Fix
prof_{malloc,free}_sample_object() call order in prof_realloc().) is not
repeated.

This resolves #499.
2017-01-17 15:16:37 -08:00
Jason Evans
ffbb7dac3d Remove leading blank lines from function bodies.
This resolves #535.
2017-01-13 14:49:24 -08:00
David Goldblatt
77cccac8cd Break up headers into constituent parts
This is part of a broader change to make header files better represent the
dependencies between one another (see
https://github.com/jemalloc/jemalloc/issues/533). It breaks up component headers
into smaller parts that can be made to have a simpler dependency graph.

For the autogenerated headers (smoothstep.h and size_classes.h), no splitting
was necessary, so I didn't add support to emit multiple headers.
2017-01-12 15:43:51 -08:00
David Goldblatt
94c5d22a4d Remove mb.h, which is unused 2017-01-11 13:24:30 -08:00
John Paul Adrian Glaubitz
77de5f27d8 Use better pre-processor defines for sparc64
Currently, jemalloc detects sparc64 targets by checking whether
__sparc64__ is defined. However, this definition is used on BSD
targets only. Linux targets define both __sparc__ and __arch64__
for sparc64. Since this also works on BSD, rather use __sparc__
and __arch64__ instead of __sparc64__ to detect sparc64 targets.
2017-01-10 17:39:54 -08:00
Jason Evans
edf1bafb2b Implement arena.<i>.destroy .
Add MALLCTL_ARENAS_DESTROYED for accessing destroyed arena stats as an
analogue to MALLCTL_ARENAS_ALL.

This resolves #382.
2017-01-06 18:58:46 -08:00
Jason Evans
6edbedd916 Range-check mib[1] --> arena_ind casts. 2017-01-06 18:58:46 -08:00
Jason Evans
c0a05e6aba Move static ctl_epoch variable into ctl_stats_t (as epoch). 2017-01-06 18:58:45 -08:00
Jason Evans
d778dd2afc Refactor ctl_stats_t.
Refactor ctl_stats_t to be a demand-zeroed non-growing data structure.
To keep the size from being onerous (~60 MiB) on 32-bit systems, convert
the arenas field to contain pointers rather than directly embedded
ctl_arena_stats_t elements.
2017-01-06 18:58:45 -08:00
Jason Evans
0f04bb1d6f Rename the arenas.extend mallctl to arenas.create. 2017-01-06 18:58:45 -08:00
Jason Evans
3dc4e83ccb Add MALLCTL_ARENAS_ALL.
Add the MALLCTL_ARENAS_ALL cpp macro as a fixed index for use
in accessing the arena.<i>.{purge,decay,dss} and stats.arenas.<i>.*
mallctls, and deprecate access via the arenas.narenas index (to be
removed in 6.0.0).
2017-01-06 18:58:45 -08:00
Jason Evans
027ace8519 Reindent. 2017-01-06 18:58:45 -08:00
Jason Evans
a0dd3a4483 Implement per arena base allocators.
Add/rename related mallctls:
- Add stats.arenas.<i>.base .
- Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal .
- Add stats.arenas.<i>.resident .

Modify the arenas.extend mallctl to take an optional (extent_hooks_t *)
argument so that it is possible for all base allocations to be serviced
by the specified extent hooks.

This resolves #463.
2016-12-26 18:08:28 -08:00
Jason Evans
a6e86810d8 Refactor purging and splitting/merging.
Split purging into lazy and forced variants.  Use the forced variant for
zeroing dss.

Add support for NULL function pointers as an opt-out mechanism for the
dalloc, commit, decommit, purge_lazy, purge_forced, split, and merge
fields of extent_hooks_t.

Add short-circuiting checks in large_ralloc_no_move_{shrink,expand}() so
that no attempt is made if splitting/merging is not supported.

This resolves #268.
2016-12-26 18:08:16 -08:00
Jason Evans
884fa22b8c Rename arena_decay_t's ndirty to nunpurged. 2016-12-26 17:59:43 -08:00
Jason Evans
411697adcd Use exponential series to size extents.
If virtual memory is retained, allocate extents such that their sizes
form an exponentially growing series.  This limits the number of
disjoint virtual memory ranges so that extent merging can be effective
even if multiple arenas' extent allocation requests are highly
interleaved.

This resolves #462.
2016-12-26 17:59:42 -08:00
Jason Evans
c1baa0a9b7 Add huge page configuration and pages_[no}huge().
Add the --with-lg-hugepage configure option, but automatically configure
LG_HUGEPAGE even if it isn't specified.

Add the pages_[no]huge() functions, which toggle huge page state via
madvise(..., MADV_[NO]HUGEPAGE) calls.
2016-12-26 17:59:34 -08:00
Jason Evans
bacb6afc6c Simplify arena_slab_regind().
Rewrite arena_slab_regind() to provide sufficient constant data for
the compiler to perform division strength reduction.  This replaces
more general manual strength reduction that was implemented before
arena_bin_info was compile-time-constant.  It would be possible to
slightly improve on the compiler-generated division code by taking
advantage of range limits that the compiler doesn't know about.
2016-12-23 10:34:34 -08:00
Jason Evans
69c26cdb01 Add some missing explicit casts. 2016-12-13 13:38:11 -08:00
Dave Watson
2319152d9f jemalloc cpp new/delete bindings
Adds cpp bindings for jemalloc, along with necessary autoconf settings.
This is mostly to add sized deallocation support, which can't be added
from C directly.  Sized deallocation is ~10% microbench improvement.

* Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the
  easiest way to get c++14 detection.
* Adds various other changes, like CXXFLAGS, to configure.ac.
* Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic
  unittest.
* Both new and delete are overridden, to ensure jemalloc is used for
  both.
* TODO future enhancement of avoiding extra PLT thunks for new and
  delete - sdallocx and malloc are publicly exported jemalloc symbols,
  using an alias would link them directly.  Unfortunately, was having
  trouble getting it to play nice with jemalloc's namespace support.

Testing:
Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0.  Only gcc >= 5 has sized
deallocation support, verified that the rest build correctly.

Tested mac osx and Centos.

Tested --with-jemalloc-prefix and --without-export.

This resolves #202.
2016-12-12 18:36:06 -08:00
Jason Evans
d4c5aceb7c Add a_type parameter to qr_{meld,split}(). 2016-12-12 18:16:51 -08:00
Jason Evans
acb7b1f53e Add --disable-syscall.
This resolves #517.
2016-12-03 16:50:58 -08:00
Jason Evans
32127949a3 Enable overriding JEMALLOC_{ALLOC,FREE}_JUNK.
This resolves #509.
2016-11-22 10:58:58 -08:00
Jason Evans
c3b85f2585 Style fixes. 2016-11-22 10:58:23 -08:00
Jason Evans
5234be2133 Add pthread_atfork(3) feature test.
Some versions of Android provide a pthreads library without providing
pthread_atfork(), so in practice a separate feature test is necessary
for the latter.
2016-11-17 15:14:57 -08:00
Jason Evans
fda60be799 Update a comment. 2016-11-17 11:50:52 -08:00
Jason Evans
a64123ce13 Refactor madvise(2) configuration.
Add feature tests for the MADV_FREE and MADV_DONTNEED flags to
madvise(2), so that MADV_FREE is detected and used for Linux kernel
versions 4.5 and newer.  Refactor pages_purge() so that on systems which
support both flags, MADV_FREE is preferred over MADV_DONTNEED.

This resolves #387.
2016-11-17 10:31:57 -08:00
Jason Evans
a38acf716e Add extent serial numbers.
Add extent serial numbers and use them where appropriate as a sort key
that is higher priority than address, so that the allocation policy
prefers older extents.

This resolves #147.
2016-11-15 13:08:33 -08:00
Jason Evans
cda59f9970 Rename atomic_*_{uint32,uint64,u}() to atomic_*_{u32,u64,zu}().
This change conforms to naming conventions throughout the codebase.
2016-11-07 11:27:48 -08:00
Jason Evans
2e46b13ad5 Revert "Define 64-bits atomics unconditionally"
This reverts commit c2942e2c0e.

This resolves #495.
2016-11-07 10:53:35 -08:00
Jason Evans
04b463546e Refactor prng to not use 64-bit atomics on 32-bit platforms.
This resolves #495.
2016-11-07 10:52:44 -08:00
Jason Evans
ea9961acdb Fix psz/pind edge cases.
Add an "over-size" extent heap in which to store extents which exceed
the maximum size class (plus cache-oblivious padding, if enabled).
Remove psz2ind_clamp() and use psz2ind() instead so that trying to
allocate the maximum size class can in principle succeed.  In practice,
this allows assertions to hold so that OOM errors can be successfully
generated.
2016-11-03 22:33:34 -07:00
Jason Evans
8dd5ea87ca Fix extent_alloc_cache[_locked]() to support decommitted allocation.
Fix extent_alloc_cache[_locked]() to support decommitted allocation, and
use this ability in arena_stash_dirty(), so that decommitted extents are
not needlessly committed during purging.  In practice this does not
happen on any currently supported systems, because both extent merging
and decommit must be implemented; all supported systems implement one
xor the other.
2016-11-03 22:33:23 -07:00
Jason Evans
4f7d8c2dee Update symbol mangling. 2016-11-03 15:00:02 -07:00
Dave Watson
25f7bbcf28 Fix long spinning in rtree_node_init
rtree_node_init spinlocks the node, allocates, and then sets the node.
This is under heavy contention at the top of the tree if many threads
start to allocate at the same time.

Instead, take a per-rtree sleeping mutex to reduce spinning.  Tested
both pthreads and osx OSSpinLock, and both reduce spinning adequately

Previous benchmark time:
./ttest1 500 100
~15s

New benchmark time:
./ttest1 500 100
.57s
2016-11-02 20:30:53 -07:00
Jason Evans
d82f2b3473 Do not use syscall(2) on OS X 10.12 (deprecated). 2016-11-02 19:18:33 -07:00
Jason Evans
795f6689de Add os_unfair_lock support.
OS X 10.12 deprecated OSSpinLock; os_unfair_lock is the recommended
replacement.
2016-11-02 18:09:45 -07:00
Jason Evans
d9f7b2a430 Fix/refactor zone allocator integration code.
Fix zone_force_unlock() to reinitialize, rather than unlocking mutexes,
since OS X 10.12 cannot tolerate a child unlocking mutexes that were
locked by its parent.

Refactor; this was a side effect of experimenting with zone
{de,re}registration during fork(2).
2016-11-02 18:06:40 -07:00
Jason Evans
90b60eeae4 Add an assertion in witness_owner(). 2016-10-31 15:28:22 -07:00
Jason Evans
6a834d94bb Refactor witness_unlock() to fix undefined test behavior.
This resolves #396.
2016-10-31 11:49:12 -07:00
Jason Evans
6c80321aed Use CLOCK_MONOTONIC_COARSE rather than COARSE_MONOTONIC_RAW.
The raw clock variant is slow (even relative to plain CLOCK_MONOTONIC),
whereas the coarse clock variant is faster than CLOCK_MONOTONIC, but
still has resolution (~1ms) that is adequate for our purposes.

This resolves #479.
2016-10-29 22:58:18 -07:00
Dave Watson
8309388408 Support static linking of jemalloc with glibc
glibc defines its malloc implementation with several weak and strong
symbols:

strong_alias (__libc_calloc, __calloc) weak_alias (__libc_calloc, calloc)
strong_alias (__libc_free, __cfree) weak_alias (__libc_free, cfree)
strong_alias (__libc_free, __free) strong_alias (__libc_free, free)
strong_alias (__libc_malloc, __malloc) strong_alias (__libc_malloc, malloc)

The issue is not with the weak symbols, but that other parts of glibc
depend on __libc_malloc explicitly.  Defining them in terms of jemalloc
API's allows the linker to drop glibc's malloc.o completely from the link,
and static linking no longer results in symbol collisions.

Another wrinkle: jemalloc during initialization calls sysconf to
get the number of CPU's.  GLIBC allocates for the first time before
setting up isspace (and other related) tables, which are used by
sysconf.  Instead, use the pthread API to get the number of
CPUs with GLIBC, which seems to work.

This resolves #442.
2016-10-28 15:08:19 -07:00
Jason Evans
48d4adfbeb Avoid negation of unsigned numbers.
Rather than relying on two's complement negation for alignment mask
generation, use bitwise not and addition.  This dodges warnings from
MSVC, and should be strength-reduced by compiler optimization anyway.
2016-10-27 21:26:33 -07:00
Jason Evans
b54d160dc4 Do not (recursively) allocate within tsd_fetch().
Refactor tsd so that tsdn_fetch() does not trigger allocation, since
allocation could cause infinite recursion.

This resolves #458.
2016-10-20 23:59:12 -07:00
Jason Evans
577d4572b0 Make dss operations lockless.
Rather than protecting dss operations with a mutex, use atomic
operations.  This has negligible impact on synchronization overhead
during typical dss allocation, but is a substantial improvement for
extent_in_dss() and the newly added extent_dss_mergeable(), which can be
called multiple times during extent deallocations.

This change also has the advantage of avoiding tsd in deallocation paths
associated with purging, which resolves potential deadlocks during
thread exit due to attempted tsd resurrection.

This resolves #425.
2016-10-13 15:37:00 -07:00
Jason Evans
e5effef428 Add/use adaptive spinning.
Add spin_t and spin_{init,adaptive}(), which provide a simple
abstraction for adaptive spinning.

Adaptively spin during busy waits in bootstrapping and rtree node
initialization.
2016-10-13 14:55:39 -07:00
Jason Evans
9acd5cf178 Remove all vestiges of chunks.
Remove mallctls:
- opt.lg_chunk
- stats.cactive

This resolves #464.
2016-10-12 11:55:43 -07:00
Jason Evans
63b5657aa5 Remove ratio-based purging.
Make decay-based purging the default (and only) mode.

Remove associated mallctls:
- opt.purge
- opt.lg_dirty_mult
- arena.<i>.lg_dirty_mult
- arenas.lg_dirty_mult
- stats.arenas.<i>.lg_dirty_mult

This resolves #385.
2016-10-12 10:40:27 -07:00
Jason Evans
b4b4a77848 Fix and simplify decay-based purging.
Simplify decay-based purging attempts to only be triggered when the
epoch is advanced, rather than every time purgeable memory increases.
In a correctly functioning system (not previously the case; see below),
this only causes a behavior difference if during subsequent purge
attempts the least recently used (LRU) purgeable memory extent is
initially too large to be purged, but that memory is reused between
attempts and one or more of the next LRU purgeable memory extents are
small enough to be purged.  In practice this is an arbitrary behavior
change that is within the set of acceptable behaviors.

As for the purging fix, assure that arena->decay.ndirty is recorded
*after* the epoch advance and associated purging occurs.  Prior to this
fix, it was possible for purging during epoch advance to cause a
substantially underrepresentative (arena->ndirty - arena->decay.ndirty),
i.e. the number of dirty pages attributed to the current epoch was too
low, and a series of unintended purges could result.  This fix is also
relevant in the context of the simplification described above, but the
bug's impact would be limited to over-purging at epoch advances.
2016-10-11 15:30:01 -07:00
Jason Evans
5f11fb7d43 Do not advance decay epoch when time goes backwards.
Instead, move the epoch backward in time.  Additionally, add
nstime_monotonic() and use it in debug builds to assert that time only
goes backward if nstime_update() is using a non-monotonic time source.
2016-10-10 22:15:10 -07:00
Jason Evans
ee0c74b77a Refactor arena->decay_* into arena->decay.* (arena_decay_t). 2016-10-10 20:32:19 -07:00
Jason Evans
e0164bc63c Refine nstime_update().
Add missing #include <time.h>.  The critical time facilities appear to
have been transitively included via unistd.h and sys/time.h, but in
principle this omission was capable of having caused
clock_gettime(CLOCK_MONOTONIC, ...) to have been overlooked in favor of
gettimeofday(), which in turn could cause spurious non-monotonic time
updates.

Refactor nstime_get() out of nstime_update() and add configure tests for
all variants.

Add CLOCK_MONOTONIC_RAW support (Linux-specific) and
mach_absolute_time() support (OS X-specific).

Do not fall back to clock_gettime(CLOCK_REALTIME, ...).  This was a
fragile Linux-specific workaround, which we're unlikely to use at all
now that clock_gettime(CLOCK_MONOTONIC_RAW, ...) is supported, and if we
have no choice besides non-monotonic clocks, gettimeofday() is only
incrementally worse.
2016-10-10 10:33:59 -07:00
Jason Evans
871a9498e1 Fix size class overflow bugs.
Avoid calling s2u() on raw extent sizes in extent_recycle().

Clamp psz2ind() (implemented as psz2ind_clamp()) when inserting/removing
into/from size-segregated extent heaps.
2016-10-03 14:18:55 -07:00
Jason Evans
42e79c58a0 Update extent hook function prototype comments. 2016-09-29 09:49:19 -07:00
Eric Le Bihan
df0d273a07 Fix LG_QUANTUM definition for sparc64
GCC 4.9.3 cross-compiled for sparc64 defines __sparc_v9__, not
__sparc64__ nor __sparcv9. This prevents LG_QUANTUM from being defined
properly. Adding this new value to the check solves the issue.
2016-09-26 15:13:07 -07:00
Jason Evans
61f467e16a Avoid self assignment in tsd_set(). 2016-09-23 12:21:34 -07:00
Jason Evans
0222fb41d1 Add various mutex ownership assertions. 2016-09-23 12:21:34 -07:00
Jason Evans
73868b60f2 Fix extent_{before,last,past}() to return page-aligned results. 2016-09-23 12:21:34 -07:00
Jason Evans
f6d01ff4b7 Protect extents_dirty access with extents_mtx.
This fixes race conditions during purging.
2016-09-22 11:57:28 -07:00
Josh Gao
17c4b8de5f Fix -Wundef in _MSC_VER check. 2016-09-15 14:33:28 -07:00
Elliot Ronaghan
1167e9eff3 Check for __builtin_unreachable at configure time
Add a configure check for __builtin_unreachable instead of basing its
availability on the __GNUC__ version. On OS X using gcc (a real gcc, not the
bundled version that's just a gcc front-end) leads to a linker assertion:

    https://github.com/jemalloc/jemalloc/issues/266

It turns out that this is caused by a gcc bug resulting from the use of
__builtin_unreachable():

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438

To work around this bug, check that __builtin_unreachable() actually works at
configure time, and if it doesn't use abort() instead. The check is based on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438#c21.

With this `make check` passes with a homebrew installed gcc-5 and gcc-6.
2016-07-07 13:28:44 -07:00
Mike Hommey
c2942e2c0e Define 64-bits atomics unconditionally
They are used on all platforms in prng.h.
2016-06-09 23:17:39 +09:00
Mike Hommey
0dad5b7719 Fix extent_*_get to build with MSVC 2016-06-09 22:00:18 +09:00
Elliot Ronaghan
8a1a794b0c Don't use compact red-black trees with the pgi compiler
Some bug (either in the red-black tree code, or in the pgi compiler) seems to
cause red-black trees to become unbalanced. This issue seems to go away if we
don't use compact red-black trees. Since red-black trees don't seem to be used
much anymore, I opted for what seems to be an easy fix here instead of digging
in and trying to find the root cause of the bug.

Some context in case it's helpful:

I experienced a ton of segfaults while using pgi as Chapel's target compiler
with jemalloc 4.0.4. The little bit of debugging I did pointed me somewhere
deep in red-black tree manipulation, but I didn't get a chance to investigate
further. It looks like 4.2.0 replaced most uses of red-black trees with
pairing-heaps, which seems to avoid whatever bug I was hitting.

However, `make check_unit` was still failing on the rb test, so I figured the
core issue was just being masked. Here's the `make check_unit` failure:

```sh
=== test/unit/rb ===
test_rb_empty: pass
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black
test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced
node_remove:test/unit/rb.c:190: Failed assertion: (imbalances) == (0) --> 2 != 0: Tree is unbalanced
<jemalloc>: test/unit/rb.c:43: Failed assertion: "pathp[-1].cmp < 0"
test/test.sh: line 22: 12926 Aborted
Test harness error
```

While starting to debug I saw the RB_COMPACT option and decided to check if
turning that off resolved the bug. It seems to have fixed it (`make check_unit`
passes and the segfaults under Chapel are gone) so it seems like on okay
work-around. I'd imagine this has performance implications for red-black trees
under pgi, but if they're not going to be used much anymore it's probably not a
big deal.
2016-06-08 14:48:55 -07:00
Jason Evans
dd752c1ffd Fix potential VM map fragmentation regression.
Revert 245ae6036c (Support --with-lg-page
values larger than actual page size.), because it could cause VM map
fragmentation if the kernel grows mmap()ed memory downward.

This resolves #391.
2016-06-07 14:15:49 -07:00
Jason Evans
4e910fc958 Fix extent_alloc_dss() regressions.
Page-align the gap, if any, and add/use extent_dalloc_gap(), which
registers the gap extent before deallocation.
2016-06-05 21:00:02 -07:00
Jason Evans
04942c3d90 Remove a stray memset(), and fix a junk filling test regression. 2016-06-05 21:00:02 -07:00
Jason Evans
f8f0542194 Modify extent hook functions to take an (extent_t *) argument.
This facilitates the application accessing its own extent allocator
metadata during hook invocations.

This resolves #259.
2016-06-05 21:00:02 -07:00
Jason Evans
6f29a83924 Add rtree lookup path caching.
rtree-based extent lookups remain more expensive than chunk-based run
lookups, but with this optimization the fast path slowdown is ~3 CPU
cycles per metadata lookup (on Intel Core i7-4980HQ), versus ~11 cycles
prior.  The path caching speedup tends to degrade gracefully unless
allocated memory is spread far apart (as is the case when using a
mixture of sbrk() and mmap()).
2016-06-05 20:59:57 -07:00
Jason Evans
7be2ebc23f Make tsd cleanup functions optional, remove noop cleanup functions. 2016-06-05 20:42:24 -07:00
Jason Evans
b14fdaaca0 Add a missing prof_alloc_rollback() call.
In the case where prof_alloc_prep() is called with an over-estimate of
allocation size, and sampling doesn't end up being triggered, the tctx
must be discarded.
2016-06-05 20:42:24 -07:00
Jason Evans
c8c3cbdf47 Miscellaneous s/chunk/extent/ updates. 2016-06-05 20:42:24 -07:00
Jason Evans
a43db1c608 Relax NBINS constraint (max 255 --> max 256). 2016-06-05 20:42:24 -07:00
Jason Evans
751f2c332d Remove obsolete stats.arenas.<i>.metadata.mapped mallctl.
Rename stats.arenas.<i>.metadata.allocated mallctl to
stats.arenas.<i>.metadata .
2016-06-05 20:42:24 -07:00
Jason Evans
03eea4fb8b Better document --enable-ivsalloc. 2016-06-05 20:42:24 -07:00
Jason Evans
22588dda6e Rename most remaining *chunk* APIs to *extent*. 2016-06-05 20:42:23 -07:00
Jason Evans
0c4932eb1e s/chunk_lookup/extent_lookup/g, s/chunks_rtree/extents_rtree/g 2016-06-05 20:42:23 -07:00
Jason Evans
4a55daa363 s/CHUNK_HOOKS_INITIALIZER/EXTENT_HOOKS_INITIALIZER/g 2016-06-05 20:42:23 -07:00
Jason Evans
c9a76481d8 Rename chunks_{cached,retained,mtx} to extents_{cached,retained,mtx}. 2016-06-05 20:42:23 -07:00
Jason Evans
127026ad98 Rename chunk_*_t hooks to extent_*_t. 2016-06-05 20:42:23 -07:00
Jason Evans
9c305c9e5c s/chunk_hook/extent_hook/g 2016-06-05 20:42:23 -07:00
Jason Evans
7d63fed0fd Rename huge to large. 2016-06-05 20:42:23 -07:00
Jason Evans
714d1640f3 Update private symbols. 2016-06-05 20:42:23 -07:00
Jason Evans
498856f44a Move slabs out of chunks. 2016-06-05 20:42:23 -07:00
Jason Evans
d28e5a6696 Improve interval-based profile dump triggering.
When an allocation is large enough to trigger multiple dumps, use
modular math rather than subtraction to reset the interval counter.
Prior to this change, it was possible for a single allocation to cause
many subsequent allocations to all trigger profile dumps.

When updating usable size for a sampled object, try to cancel out
the difference between LARGE_MINCLASS and usable size from the interval
counter.
2016-06-05 20:42:23 -07:00
Jason Evans
ed2c2427a7 Use huge size class infrastructure for large size classes. 2016-06-05 20:42:18 -07:00
Jason Evans
b46261d58b Implement cache-oblivious support for huge size classes. 2016-06-03 12:27:41 -07:00
Jason Evans
4731cd47f7 Allow chunks to not be naturally aligned.
Precisely size extents for huge size classes that aren't multiples of
chunksize.
2016-06-03 12:27:41 -07:00
Jason Evans
741967e79d Remove CHUNK_ADDR2BASE() and CHUNK_ADDR2OFFSET(). 2016-06-03 12:27:41 -07:00
Jason Evans
23c52c895f Make extent_prof_tctx_[gs]et() atomic. 2016-06-03 12:27:41 -07:00
Jason Evans
760bf11b23 Add extent_dirty_[gs]et(). 2016-06-03 12:27:41 -07:00
Jason Evans
47613afc34 Convert rtree from per chunk to per page.
Refactor [de]registration to maintain interior rtree entries for slabs.
2016-06-03 12:27:41 -07:00
Jason Evans
5c6be2bdd3 Refactor chunk_purge_wrapper() to take extent argument. 2016-06-03 12:27:41 -07:00
Jason Evans
0eb6f08959 Refactor chunk_[de]commit_wrapper() to take extent arguments. 2016-06-03 12:27:41 -07:00
Jason Evans
6c94470822 Refactor chunk_dalloc_{cache,wrapper}() to take extent arguments.
Rename arena_extent_[d]alloc() to extent_[d]alloc().

Move all chunk [de]registration responsibility into chunk.c.
2016-06-03 12:27:41 -07:00
Jason Evans
de0305a7f3 Add/use chunk_split_wrapper().
Remove redundant ptr/oldsize args from huge_*().

Refactor huge/chunk/arena code boundaries.
2016-06-03 12:27:41 -07:00
Jason Evans
1ad060584f Add/use chunk_merge_wrapper(). 2016-06-03 12:27:41 -07:00
Jason Evans
384e88f451 Add/use chunk_commit_wrapper(). 2016-06-03 12:27:41 -07:00
Jason Evans
56e0031d7d Add/use chunk_decommit_wrapper(). 2016-06-03 12:27:41 -07:00
Jason Evans
4d2d9cec5a Merge chunk_alloc_base() into its only caller. 2016-06-03 12:27:41 -07:00
Jason Evans
fc0372a15e Replace extent_tree_szad_* with extent_heap_*. 2016-06-03 12:27:41 -07:00
Jason Evans
ffa45a5331 Use rtree rather than [sz]ad trees for chunk split/coalesce operations. 2016-06-03 12:27:41 -07:00
Jason Evans
93e79c5c3f Remove redundant chunk argument from chunk_{,de,re}register(). 2016-06-03 12:27:41 -07:00
Jason Evans
9aea58d9a2 Add extent_past_get(). 2016-06-03 12:27:41 -07:00
Jason Evans
d78846c989 Replace extent_achunk_[gs]et() with extent_slab_[gs]et(). 2016-06-03 12:27:41 -07:00
Jason Evans
fae8344098 Add extent_active_[gs]et().
Always initialize extents' runs_dirty and chunks_cache linkage.
2016-06-03 12:27:41 -07:00
Jason Evans
6f71844659 Move *PAGE* definitions to pages.h. 2016-06-03 12:27:41 -07:00
Jason Evans
e75e9be130 Add rtree element witnesses. 2016-06-03 12:27:41 -07:00
Jason Evans
8c9be3e837 Refactor rtree to always use base_alloc() for node allocation. 2016-06-03 12:27:41 -07:00
Jason Evans
db72272bef Use rtree-based chunk lookups rather than pointer bit twiddling.
Look up chunk metadata via the radix tree, rather than using
CHUNK_ADDR2BASE().

Propagate pointer's containing extent.

Minimize extent lookups by doing a single lookup (e.g. in free()) and
propagating the pointer's extent into nearly all the functions that may
need it.
2016-06-03 12:27:41 -07:00
Jason Evans
2d2b4e98c9 Add element acquire/release capabilities to rtree.
This makes it possible to acquire short-term "ownership" of rtree
elements so that it is possible to read an extent pointer *and* read the
extent's contents with a guarantee that the element will not be modified
until the ownership is released.  This is intended as a mechanism for
resolving rtree read/write races rather than as a way to lock extents.
2016-06-03 12:27:33 -07:00
Jason Evans
a7a6f5bc96 Rename extent_node_t to extent_t. 2016-05-16 12:21:28 -07:00
Jason Evans
3aea827f5e Simplify run quantization. 2016-05-16 12:21:27 -07:00
Jason Evans
7bb00ae9d6 Refactor runs_avail.
Use pszind_t size classes rather than szind_t size classes, and always
reserve space for NPSIZES elements.  This removes unused heaps that are
not multiples of the page size, and adds (currently) unused heaps for
all huge size classes, with the immediate benefit that the size of
arena_t allocations is constant (no longer dependent on chunk size).
2016-05-16 12:21:21 -07:00
Jason Evans
226c446979 Implement pz2ind(), pind2sz(), and psz2u().
These compute size classes and indices similarly to size2index(),
index2size() and s2u(), respectively, but using the subset of size
classes that are multiples of the page size.  Note that pszind_t and
szind_t are not interchangeable.
2016-05-13 10:31:54 -07:00
Jason Evans
627372b459 Initialize arena_bin_info at compile time rather than at boot time.
This resolves #370.
2016-05-13 10:31:30 -07:00
Jason Evans
b683734b43 Implement BITMAP_INFO_INITIALIZER(nbits).
This allows static initialization of bitmap_info_t structures.
2016-05-13 10:27:48 -07:00
Jason Evans
17c021c177 Remove redzone support.
This resolves #369.
2016-05-13 10:27:33 -07:00
Jason Evans
ba5c709517 Remove quarantine support. 2016-05-13 10:25:05 -07:00
Jason Evans
9a8add1510 Remove Valgrind support. 2016-05-13 09:56:18 -07:00
Jason Evans
a397045323 Use TSDN_NULL rather than NULL as appropriate. 2016-05-12 21:07:08 -07:00
Jason Evans
73d3d58dc2 Optimize witness fast path.
Short-circuit commonly called witness functions so that they only
execute in debug builds, and remove equivalent guards from mutex
functions.  This avoids pointless code execution in
witness_assert_lockless(), which is typically called twice per
allocation/deallocation function invocation.

Inline commonly called witness functions so that optimized builds can
completely remove calls as dead code.
2016-05-11 15:38:06 -07:00
Jason Evans
c1e00ef2a6 Resolve bootstrapping issues when embedded in FreeBSD libc.
b2c0d6322d (Add witness, a simple online
locking validator.) caused a broad propagation of tsd throughout the
internal API, but tsd_fetch() was designed to fail prior to tsd
bootstrapping.  Fix this by splitting tsd_t into non-nullable tsd_t and
nullable tsdn_t, and modifying all internal APIs that do not critically
rely on tsd to take nullable pointers.  Furthermore, add the
tsd_booted_get() function so that tsdn_fetch() can probe whether tsd
bootstrapping is complete and return NULL if not.  All dangerous
conversions of nullable pointers are tsdn_tsd() calls that assert-fail
on invalid conversion.
2016-05-10 22:51:33 -07:00
Jason Evans
919e4a0ea9 Add LG_QUANTUM definition for the RISC-V architecture. 2016-05-06 17:15:32 -07:00
Jason Evans
1326010cf4 Update private_symbols.txt. 2016-05-06 14:50:58 -07:00
Jason Evans
3ef51d7f73 Optimize the fast paths of calloc() and [m,d,sd]allocx().
This is a broader application of optimizations to malloc() and free() in
f4a0f32d34 (Fast-path improvement:
reduce # of branches and unnecessary operations.).

This resolves #321.
2016-05-06 14:37:39 -07:00
Jason Evans
c2f970c32b Modify pages_map() to support mapping uncommitted virtual memory.
If the OS overcommits:
- Commit all mappings in pages_map() regardless of whether the caller
  requested committed memory.
- Linux-specific: Specify MAP_NORESERVE to avoid
  unfortunate interactions with heuristic overcommit mode during
  fork(2).

This resolves #193.
2016-05-05 18:56:17 -07:00
Jason Evans
04c3c0f9a0 Add the stats.retained and stats.arenas.<i>.retained statistics.
This resolves #367.
2016-05-03 22:11:35 -07:00
Jason Evans
90827a3f3e Fix huge_palloc() regression.
Split arena_choose() into arena_[i]choose() and use arena_ichoose() for
arena lookup during internal allocation.  This fixes huge_palloc() so
that it always succeeds during extent node allocation.

This regression was introduced by
66cd953514 (Do not allocate metadata via
non-auto arenas, nor tcaches.).
2016-05-03 17:19:15 -07:00
Jason Evans
108c4a11e9 Fix witness/fork() interactions.
Fix witness to clear its list of owned mutexes in the child if
platform-specific malloc_mutex code re-initializes mutexes rather than
unlocking them.
2016-04-26 10:47:22 -07:00
Jason Evans
174c0c3a9c Fix fork()-related lock rank ordering reversals. 2016-04-25 23:16:20 -07:00
Jason Evans
71d94828a2 Fix degenerate mb_write() compilation error.
This resolves #375.
2016-04-22 21:27:17 -07:00
Jason Evans
19ff2cefba Implement the arena.<i>.reset mallctl.
This makes it possible to discard all of an arena's allocations in a
single operation.

This resolves #146.
2016-04-22 15:20:06 -07:00
Jason Evans
66cd953514 Do not allocate metadata via non-auto arenas, nor tcaches.
This assures that all internally allocated metadata come from the
first opt_narenas arenas, i.e. the automatically multiplexed arenas.
2016-04-22 15:19:59 -07:00
Jason Evans
b6e07d2389 Fix malloc_mutex_assert_[not_]owner() for --enable-lazy-lock case. 2016-04-18 15:42:09 -07:00
Jason Evans
ab0cfe01fa Update private_symbols.txt.
Change test-related mangling to simplify symbol filtering.

The following commands can be used to detect missing/obsolete symbol
mangling, with the caveat that the full set of symbols is based on the
union of symbols generated by all configurations, some of which are
platform-specific:

./autogen.sh --enable-debug --enable-prof --enable-lazy-lock
make all tests
nm -a lib/libjemalloc.a src/*.jet.o \
  |grep " [TDBCR] " \
  |awk '{print $3}' \
  |sed -e 's/^\(je_\|jet_\(n_\)\?\)\([a-zA-Z0-9_]*\)/\3/g' \
  |LC_COLLATE=C sort -u \
  |grep -v \
   -e '^\(malloc\|calloc\|posix_memalign\|aligned_alloc\|realloc\|free\)$' \
   -e '^\(m\|r\|x\|s\|d\|sd\|n\)allocx$' \
   -e '^mallctl\(\|nametomib\|bymib\)$' \
   -e '^malloc_\(stats_print\|usable_size\|message\)$' \
   -e '^\(memalign\|valloc\)$' \
   -e '^__\(malloc\|memalign\|realloc\|free\)_hook$' \
   -e '^pthread_create$' \
  > /tmp/private_symbols.txt
2016-04-18 15:23:35 -07:00
Rajat Goel
a0c632c9d5 Update private_symbols.txt
Add 4 missing symbols
2016-04-18 11:54:09 -07:00
Jason Evans
1423ee9016 Fix style nits. 2016-04-17 13:44:59 -07:00
Jason Evans
1b5830178f Fix malloc_mutex_[un]lock() to conditionally check witness.
Also remove tautological cassert(config_debug) calls.
2016-04-17 13:44:59 -07:00
Jason Evans
2288424325 s/MALLOC_MUTEX_RANK_OMIT/WITNESS_RANK_OMIT/
This fixes a compilation error caused by
b2c0d6322d (Add witness, a simple online
locking validator.).

This resolves #375.
2016-04-14 12:18:55 -07:00
Jason Evans
a15841cc7d Fix a compilation error.
Fix a compilation error that occurs if Valgrind is not enabled.  This
regression was caused by b2c0d6322d (Add
witness, a simple online locking validator.).
2016-04-14 02:12:33 -07:00
Jason Evans
b2c0d6322d Add witness, a simple online locking validator.
This resolves #358.
2016-04-14 02:09:28 -07:00
Jason Evans
8413463f3a Fix a style nit. 2016-04-12 23:18:25 -07:00
Jason Evans
667eca2ac2 Simplify RTREE_HEIGHT_MAX definition.
Use 1U rather than ZU(1) in macro definitions, so that the preprocessor
can evaluate the resulting expressions.
2016-04-11 02:35:00 -07:00
Jason Evans
245ae6036c Support --with-lg-page values larger than actual page size.
During over-allocation in preparation for creating aligned mappings,
allocate one more page than necessary if PAGE is the actual page size,
so that trimming still succeeds even if the system returns a mapping
that has less than PAGE alignment.  This allows compiling with e.g. 64
KiB "pages" on systems that actually use 4 KiB pages.

Note that for e.g. --with-lg-page=21, it is also necessary to increase
the chunk size (e.g. --with-malloc-conf=lg_chunk:22) so that there are
at least two "pages" per chunk.  In practice this isn't a particularly
compelling configuration because so much (unusable) virtual memory is
dedicated to chunk headers.
2016-04-11 02:35:00 -07:00
Jason Evans
96aa67aca8 Clean up char vs. uint8_t in junk filling code.
Consistently use uint8_t rather than char for junk filling code.
2016-04-11 02:26:35 -07:00
Jason Evans
c6a2c39404 Refactor/fix ph.
Refactor ph to support configurable comparison functions.  Use a cpp
macro code generation form equivalent to the rb macros so that pairing
heaps can be used for both run heaps and chunk heaps.

Remove per node parent pointers, and instead use leftmost siblings' prev
pointers to track parents.

Fix multi-pass sibling merging to iterate over intermediate results
using a FIFO, rather than a LIFO.  Use this fixed sibling merging
implementation for both merge phases of the auxiliary twopass algorithm
(first merging the aux list, then replacing the root with its merged
children).  This fixes both degenerate merge behavior and the potential
for deep recursion.

This regression was introduced by
6bafa6678f (Pairing heap).

This resolves #371.
2016-04-11 02:15:42 -07:00
Jason Evans
2ee2f1ec57 Reduce differences between alternative bitmap implementations. 2016-04-06 10:38:47 -07:00
Jason Evans
4a8abbb400 Fix bitmap_sfu() regression.
Fix bitmap_sfu() to shift by LG_BITMAP_GROUP_NBITS rather than
hard-coded 6 when using linear (non-USE_TREE) bitmap search.  In
practice this affects only 64-bit systems for which sizeof(long) is not
8 (i.e. Windows), since USE_TREE is defined for 32-bit systems.

This regression was caused by b8823ab026
(Use linear scan for small bitmaps).

This resolves #368.
2016-04-06 10:32:06 -07:00
Chris Peterson
a82070ef5f Add JEMALLOC_ALLOC_JUNK and JEMALLOC_FREE_JUNK macros
Replace hardcoded 0xa5 and 0x5a junk values with JEMALLOC_ALLOC_JUNK and
JEMALLOC_FREE_JUNK macros, respectively.
2016-03-31 11:23:29 -07:00
Jason Evans
ce7c0f999b Fix potential chunk leaks.
Move chunk_dalloc_arena()'s implementation into chunk_dalloc_wrapper(),
so that if the dalloc hook fails, proper decommit/purge/retain cascading
occurs.  This fixes three potential chunk leaks on OOM paths, one during
dss-based chunk allocation, one during chunk header commit (currently
relevant only on Windows), and one during rtree write (e.g. if rtree
node allocation fails).

Merge chunk_purge_arena() into chunk_purge_default() (refactor, no
change to functionality).
2016-03-30 18:36:04 -07:00
Chris Peterson
f3060284c5 Remove unused arenas_extend() function declaration.
The arenas_extend() function was renamed to arenas_init() in commit
8bb3198f72, but its function declaration
was not removed from jemalloc_internal.h.in.
2016-03-26 01:03:24 -07:00
Jason Evans
af3184cac0 Use abort() for fallback implementations of unreachable(). 2016-03-24 01:42:08 -07:00
Jason Evans
61a6dfcd5f Constify various internal arena APIs. 2016-03-23 16:15:42 -07:00
Jason Evans
6a885198c2 Always inline performance-critical rtree operations. 2016-03-23 16:15:42 -07:00
Jason Evans
6c460ad91b Optimize rtree_get().
Specialize fast path to avoid code that cannot execute for dependent
loads.

Manually unroll.
2016-03-22 17:54:35 -07:00
Jason Evans
22af74e106 Refactor out signed/unsigned comparisons. 2016-03-15 09:40:02 -07:00
Jason Evans
824b947be0 Add (size_t) casts to MALLOCX_ALIGN().
Add (size_t) casts to MALLOCX_ALIGN() macros so that passing the integer
constant 0x80000000 does not cause a compiler warning about invalid
shift amount.

This resolves #354.
2016-03-11 10:11:56 -08:00
Rajeev Misra
ca18f2834e typecast address to pointer to byte to avoid unaligned memory access error 2016-03-10 22:49:05 -08:00
Jason Evans
613cdc80f6 Convert arena_bin_t's runs from a tree to a heap. 2016-03-08 13:48:27 -08:00
Dave Watson
4a0dbb5ac8 Use pairing heap for arena->runs_avail
Use pairing heap instead of red black tree in arena runs_avail.  The
extra links are unioned with the bitmap_t, so this change doesn't use
any extra memory.

Canaries show this change to be a 1% cpu win, and 2% latency win.  In
particular, large free()s, and small bin frees are now O(1) (barring
coalescing).

I also tested changing bin->runs to be a pairing heap, but saw a much
smaller win, and it would mean increasing the size of arena_run_s by two
pointers, so I left that as an rb-tree for now.
2016-03-08 13:48:27 -08:00
Jason Evans
f8d80d62a8 Refactor ph_merge_ordered() out of ph_merge(). 2016-03-08 13:48:27 -08:00
Dave Watson
6bafa6678f Pairing heap
Initial implementation of a twopass pairing heap with aux list.
Research papers linked in comments.

Where search/nsearch/last aren't needed, this gives much faster first(),
delete(), and insert().  Insert is O(1), and first/delete don't have to
walk the whole tree.

Also tested rb_old with parent pointers - it was better than the current
rb.h for memory loads, but still much worse than a pairing heap.

An array-based heap would be much faster if everything fits in memory,
but on a cold cache it has many more memory loads for most operations.
2016-03-08 13:46:19 -08:00
Jason Evans
022f6891fa Avoid a potential innocuous compiler warning.
Add a cast to avoid comparing a ssize_t value to a uint64_t value that
is always larger than a 32-bit ssize_t.  This silences an innocuous
compiler warning from e.g. gcc 4.2.1 about the comparison always having
the same result.
2016-03-02 22:45:37 -08:00
Dmitri Smirnov
86478b2998 Remove errno overrides. 2016-02-29 10:45:49 -08:00
Jason Evans
3c07f803aa Fix stats.arenas.<i>.[...] for --disable-stats case.
Add missing stats.arenas.<i>.{dss,lg_dirty_mult,decay_time}
initialization.

Fix stats.arenas.<i>.{pactive,pdirty} to read under the protection of
the arena mutex.
2016-02-27 20:40:13 -08:00
Jason Evans
69acd25a64 Add/alphabetize private symbols. 2016-02-27 15:35:52 -08:00
Jason Evans
40ee9aa957 Fix stats.cactive accounting regression.
Fix stats.cactive accounting to always increase/decrease by multiples of
the chunk size, even for huge size classes that are not multiples of the
chunk size, e.g. {2.5, 3, 3.5, 5, 7} MiB with 2 MiB chunk size.  This
regression was introduced by 155bfa7da1
(Normalize size classes.) and first released in 4.0.0.

This resolves #336.
2016-02-27 15:35:52 -08:00
Jason Evans
20fad3430c Refactor some bitmap cpp logic. 2016-02-26 14:43:39 -08:00
Dave Watson
b8823ab026 Use linear scan for small bitmaps
For small bitmaps, a linear scan of the bitmap is slightly faster than
a tree search - bitmap_t is more compact, and there are fewer writes
since we don't have to propogate state transitions up the tree.
On x86_64 with the current settings, I'm seeing ~.5%-1% CPU improvement
in production canaries with this change.

The old tree code is left since 32bit sizes are much larger (and ffsl
smaller), and maybe the run sizes will change in the future.

This resolves #339.
2016-02-26 14:21:10 -08:00
Jason Evans
01ecdf32d6 Miscellaneous bitmap refactoring. 2016-02-26 14:21:10 -08:00
Jason Evans
42ce80e15a Silence miscellaneous 64-to-32-bit data loss warnings.
This resolves #341.
2016-02-25 20:51:00 -08:00
Jason Evans
0c516a00c4 Make *allocx() size class overflow behavior defined.
Limit supported size and alignment to HUGE_MAXCLASS, which in turn is
now limited to be less than PTRDIFF_MAX.

This resolves #278 and #295.
2016-02-25 15:29:49 -08:00
Jason Evans
767d85061a Refactor arenas array (fixes deadlock).
Refactor the arenas array, which contains pointers to all extant arenas,
such that it starts out as a sparse array of maximum size, and use
double-checked atomics-based reads as the basis for fast and simple
arena_get().  Additionally, reduce arenas_lock's role such that it only
protects against arena initalization races.  These changes remove the
possibility for arena lookups to trigger locking, which resolves at
least one known (fork-related) deadlock.

This resolves #315.
2016-02-24 23:58:10 -08:00
Jason Evans
c7a9a6c86b Attempt mmap-based in-place huge reallocation.
Attempt mmap-based in-place huge reallocation by plumbing new_addr into
chunk_alloc_mmap().  This can dramatically speed up incremental huge
reallocation.

This resolves #335.
2016-02-24 17:23:18 -08:00
Jason Evans
aa63d5d377 Fix ffs_zu() compilation error on MinGW.
This regression was caused by 9f4ee6034c
(Refactor jemalloc_ffs*() into ffs_*().).
2016-02-24 14:01:47 -08:00
Jason Evans
9e1810ca9d Silence miscellaneous 64-to-32-bit data loss warnings. 2016-02-24 13:03:48 -08:00
Jason Evans
1c42a04cc6 Change lg_floor() return type from size_t to unsigned. 2016-02-24 13:03:48 -08:00
Jason Evans
8f683b94a7 Make opt_narenas unsigned rather than size_t. 2016-02-24 13:03:48 -08:00
Jason Evans
603b3bd413 Make nhbins unsigned rather than size_t. 2016-02-24 13:03:48 -08:00
Jason Evans
9f4ee6034c Refactor jemalloc_ffs*() into ffs_*().
Use appropriate versions to resolve 64-to-32-bit data loss warnings.
2016-02-24 13:03:48 -08:00
Dmitri Smirnov
b41a07c31a Fix Windows build issues
This resolves #333.
2016-02-23 18:55:45 -08:00
Jason Evans
ae45142adc Collapse arena_avail_tree_* into arena_run_tree_*.
These tree types converged to become identical, yet they still had
independently generated red-black tree implementations.
2016-02-23 18:27:24 -08:00
Dave Watson
3417a304cc Separate arena_avail trees
Separate run trees by index, replacing the previous quantize logic.
Quantization by index is now performed only on insertion / removal from
the tree, and not on node comparison, saving some cpu.  This also means
we don't have to dereference the miscelm* pointers, saving half of the
memory loads from miscelms/mapbits that have fallen out of cache.  A
linear scan of the indicies appears to be fast enough.

The only cost of this is an extra tree array in each arena.
2016-02-23 18:09:36 -08:00
Dave Watson
2b1fc90b7b Remove rbt_nil
Since this is an intrusive tree, rbt_nil is the whole size of the node
and can be quite large.  For example, miscelm is ~100 bytes.
2016-02-23 18:09:25 -08:00
Jason Evans
0da8ce1e96 Use table lookup for run_quantize_{floor,ceil}().
Reduce run quantization overhead by generating lookup tables during
bootstrapping, and using the tables for all subsequent run quantization.
2016-02-22 16:47:34 -08:00
Jason Evans
a9a4684792 Test run quantization.
Also rename run_quantize_*() to improve clarity.  These tests
demonstrate that run_quantize_ceil() is flawed.
2016-02-22 14:58:05 -08:00
Jason Evans
817d9030a5 Indentation style cleanup. 2016-02-22 10:44:58 -08:00
Jason Evans
9bad079039 Refactor time_* into nstime_*.
Use a single uint64_t in nstime_t to store nanoseconds rather than using
struct timespec.  This reduces fragility around conversions between long
and uint64_t, especially missing casts that only cause problems on
32-bit platforms.
2016-02-21 21:39:05 -08:00
Jason Evans
788d29d397 Fix Windows-specific prof-related compilation portability issues. 2016-02-20 23:46:14 -08:00
Jason Evans
56139dc403 Remove _WIN32-specific struct timespec declaration.
struct timespec is already defined by the system (at least on MinGW).
2016-02-20 23:43:17 -08:00
Jason Evans
ecae12323d Fix overflow in prng_range().
Add jemalloc_ffs64() and use it instead of jemalloc_ffsl() in
prng_range(), since long is not guaranteed to be a 64-bit type.
2016-02-20 23:41:33 -08:00
Jason Evans
aac93f414e Add symbol mangling for prng_[lg_]range(). 2016-02-20 11:26:00 -08:00
rustyx
3c2c5a5071 Fix warning in ipalloc 2016-02-20 10:55:18 -08:00
rustyx
7f283980f0 getpid() fix for Win32 2016-02-20 10:52:53 -08:00
rustyx
46e0b2301c Detect LG_SIZEOF_PTR depending on MSVC platform target 2016-02-20 10:50:24 -08:00
Christopher Ferris
effaf7d40f Fix a typo in the ckh_search() prototype. 2016-02-20 10:26:17 -08:00
Jason Evans
a0aaad1afa Handle unaligned keys in hash().
Reported by Christopher Ferris <cferris@google.com>.
2016-02-20 10:23:48 -08:00
Jason Evans
243f7a0508 Implement decay-based unused dirty page purging.
This is an alternative to the existing ratio-based unused dirty page
purging, and is intended to eventually become the sole purging
mechanism.

Add mallctls:
- opt.purge
- opt.decay_time
- arena.<i>.decay
- arena.<i>.decay_time
- arenas.decay_time
- stats.arenas.<i>.decay_time

This resolves #325.
2016-02-19 20:56:21 -08:00
Jason Evans
8e82af1166 Implement smoothstep table generation.
Check in a generated smootherstep table as smoothstep.h rather than
generating it at configure time, since not all systems (e.g. Windows)
have dc.
2016-02-19 20:56:15 -08:00
Jason Evans
db927b6727 Refactor arenas_cache tsd.
Refactor arenas_cache tsd into arenas_tdata, which is a structure of
type arena_tdata_t.
2016-02-19 20:32:37 -08:00
Jason Evans
578cd16581 Refactor arena_malloc_hard() out of arena_malloc(). 2016-02-19 20:32:32 -08:00
Jason Evans
34676d3369 Refactor prng* from cpp macros into inline functions.
Remove 32-bit variant, convert prng64() to prng_lg_range(), and add
prng_range().
2016-02-19 20:29:06 -08:00
Jason Evans
c87ab25d18 Use ticker for incremental tcache GC. 2016-02-19 20:29:06 -08:00
Jason Evans
9998000b2b Implement ticker.
Implement ticker, which provides a simple API for ticking off some
number of events before indicating that the ticker has hit its limit.
2016-02-19 20:29:06 -08:00
Jason Evans
94451d184b Flesh out time_*() API. 2016-02-19 20:29:06 -08:00
Cameron Evans
e5d5a4a517 Add time_update(). 2016-02-19 20:29:06 -08:00
Jason Evans
f829009929 Add --with-malloc-conf.
Add --with-malloc-conf, which makes it possible to embed a default
options string during configuration.
2016-02-19 20:29:06 -08:00
Jason Evans
ef349f3f94 Fix arena_sdalloc() line wrapping. 2016-02-19 20:29:06 -08:00
Jason Evans
f9e3459f75 Tweak code to allow compilation of concatenated src/*.c sources.
This resolves #294.
2015-11-12 11:06:41 -08:00
Jason Evans
a6ec1c869e Fix a comment. 2015-11-12 10:51:32 -08:00
Qi Wang
f4a0f32d34 Fast-path improvement: reduce # of branches and unnecessary operations.
- Combine multiple runtime branches into a single malloc_slow check.
- Avoid calling arena_choose / size2index / index2size on fast path.
- A few micro optimizations.
2015-11-10 14:28:34 -08:00
Joshua Kahn
e8ab0ab9c0 Add function to destroy tree
ex_destroy iterates over the tree using post-order traversal so nodes
can be removed and processed by the callback function without paying the
cost to rebalance the tree. The destruction process cannot be stopped
once started.
2015-11-09 15:56:18 -08:00
Joshua Kahn
13b4015531 Allow const keys for lookup
Signed-off-by: Steve Dougherty <sdougherty@barracuda.com>

This resolves #281.
2015-11-09 15:48:05 -08:00
Steve Dougherty
bd418ce11e Assert compact color bit is unused
Signed-off-by: Joshua Kahn <jkahn@barracuda.com>

This resolves #280.
2015-11-09 15:44:30 -08:00
Nathan Froyd
566d4c0240 use correct macro definitions for clang-cl
clang-cl, an MSVC-compatible frontend built on top of clang, defined
_MSC_VER *and* supports __attribute__ syntax.  The ordering of the
checks in jemalloc_macros.h.in, however, do the wrong thing for
clang-cl, as we want the Windows-specific macro definitions for
clang-cl.  To support this use case, we reorder the checks so that
_MSC_VER is checked first (which includes clang-cl), and then
JEMALLOC_HAVE_ATTR) is checked.  No functionality change intended.
2015-11-09 15:27:14 -08:00
Jason Evans
a784e411f2 Fix a xallocx(..., MALLOCX_ZERO) bug.
Fix xallocx(..., MALLOCX_ZERO to zero the last full trailing page of
large allocations that have been randomly assigned an offset of 0 when
--enable-cache-oblivious configure option is enabled.  This addresses a
special case missed in d260f442ce (Fix
xallocx(..., MALLOCX_ZERO) bugs.).
2015-09-24 22:21:55 -07:00
Craig Rodrigues
66814c1a52 Fix tsd_boot1() to use explicit 'void' parameter list. 2015-09-20 21:57:32 -07:00
Jason Evans
6d91929e52 Address portability issues on Solaris.
Don't assume Bourne shell is in /bin/sh when running size_classes.sh .

Consider __sparcv9 a synonym for __sparc64__ when defining LG_QUANTUM.

This resolves #275.
2015-09-15 10:42:36 -07:00
Jason Evans
708ed79834 Resolve an unsupported special case in arena_prof_tctx_set().
Add arena_prof_tctx_reset() and use it instead of arena_prof_tctx_set()
when resetting the tctx pointer during reallocation, which happens
whenever an originally sampled reallocated object is not sampled during
reallocation.

This regression was introduced by
594c759f37 (Optimize
arena_prof_tctx_set().)
2015-09-14 23:57:58 -07:00
Jason Evans
ea8d97b897 Fix prof_{malloc,free}_sample_object() call order in prof_realloc().
Fix prof_realloc() to call prof_free_sampled_object() after calling
prof_malloc_sample_object().  Prior to this fix, if tctx and old_tctx
were the same, the tctx could have been prematurely destroyed.
2015-09-14 23:57:52 -07:00
Jason Evans
cec0d63d8b Make one call to prof_active_get_unlocked() per allocation event.
Make one call to prof_active_get_unlocked() per allocation event, and
use the result throughout the relevant functions that handle an
allocation event.  Also add a missing check in prof_realloc().  These
fixes protect allocation events against concurrent prof_active changes.
2015-09-14 23:55:48 -07:00
Jason Evans
676df88e48 Rename arena_maxclass to large_maxclass.
arena_maxclass is no longer an appropriate name, because arenas also
manage huge allocations.
2015-09-11 20:50:20 -07:00
Jason Evans
560a4e1e01 Fix xallocx() bugs.
Fix xallocx() bugs related to the 'extra' parameter when specified as
non-zero.
2015-09-11 20:40:34 -07:00
Jason Evans
a00b10735a Fix "prof.reset" mallctl-related corruption.
Fix heap profiling to distinguish among otherwise identical sample sites
with interposed resets (triggered via the "prof.reset" mallctl).  This
bug could cause data structure corruption that would most likely result
in a segfault.
2015-09-09 23:16:10 -07:00
Jason Evans
b4330b02a8 Fix pointer comparision with undefined behavior.
This didn't cause bad code generation in the one case spot-checked (gcc
4.8.1), but had the potential to to so.  This bug was introduced by
594c759f37 (Optimize
arena_prof_tctx_set().).
2015-09-04 10:31:41 -07:00
Jason Evans
594c759f37 Optimize arena_prof_tctx_set().
Optimize arena_prof_tctx_set() to avoid reading run metadata when
deciding whether it's actually necessary to write.
2015-09-02 14:52:24 -07:00
Jason Evans
5d2e875ac9 Add JEMALLOC_CXX_THROW to the memalign() function prototype.
Add JEMALLOC_CXX_THROW to the memalign() function prototype, in order to
match glibc and avoid compilation errors when including both
jemalloc/jemalloc.h and malloc.h in C++ code.

This change was unintentionally omitted from
ae93d6bf36 (Avoid function prototype
incompatibilities.).
2015-08-26 13:47:20 -07:00
Jason Evans
b5c2a347d7 Silence compiler warnings for unreachable code.
Reported by Ingvar Hagelund.
2015-08-19 23:28:34 -07:00
Jason Evans
d01fd19755 Rename index_t to szind_t to avoid an existing type on Solaris.
This resolves #256.
2015-08-19 15:21:32 -07:00
Jason Evans
5ef33a9f2b Don't bitshift by negative amounts.
Don't bitshift by negative amounts when encoding/decoding run sizes in
chunk header maps.  This affected systems with page sizes greater than 8
KiB.

Reported by Ingvar Hagelund <ingvar@redpill-linpro.com>.
2015-08-19 14:16:30 -07:00
Jason Evans
85ae064e96 Fix a comment. 2015-08-13 14:54:06 -07:00
Jason Evans
fead75fd52 Fix gcc build failure (define __has_builtin). 2015-08-12 16:46:09 -07:00
Jason Evans
7928f62273 Check whether gcc version supports __builtin_unreachable(). 2015-08-12 16:38:39 -07:00
Jason Evans
694d0829c0 Update list of private symbols. 2015-08-12 13:03:43 -07:00
Jason Evans
1f27abc1b1 Refactor arena_mapbits_{small,large}_set() to not preserve unzeroed.
Fix arena_run_split_large_helper() to treat newly committed memory as
zeroed.
2015-08-11 16:45:47 -07:00
Jason Evans
6bdeddb697 Fix build failure.
This regression was introduced by
de249c8679 (Arena chunk decommit cleanups
and fixes.).

This resolves #254.
2015-08-10 23:42:33 -07:00
Jason Evans
45186f0c07 Refactor arena_mapbits unzeroed flag management.
Only set the unzeroed flag when initializing the entire mapbits entry,
rather than mutating just the unzeroed bit.  This simplifies the
possible mapbits state transitions.
2015-08-10 23:03:34 -07:00
Jason Evans
de249c8679 Arena chunk decommit cleanups and fixes.
Decommit arena chunk header during chunk deallocation if the rest of the
chunk is decommitted.
2015-08-10 17:13:59 -07:00
Jason Evans
8fadb1a8c2 Implement chunk hook support for page run commit/decommit.
Cascade from decommit to purge when purging unused dirty pages, so that
it is possible to decommit cleaned memory rather than just purging.  For
non-Windows debug builds, decommit runs rather than purging them, since
this causes access of deallocated runs to segfault.

This resolves #251.
2015-08-07 00:50:58 -07:00
Daniel Micay
67c46a9e53 work around _FORTIFY_SOURCE false positive
In builds with profiling disabled (default), the opt_prof_prefix array
has a one byte length as a micro-optimization. This will cause the usage
of write in the unused profiling code to be statically detected as a
buffer overflow by Bionic's _FORTIFY_SOURCE implementation as it tries
to detect read overflows in addition to write overflows.

This works around the problem by informing the compiler that
not_reached() means code in unreachable in release builds.
2015-08-04 17:09:43 -04:00
Matthijs
c1a6a51e40 MSVC compatibility changes
- Decorate public function with __declspec(allocator) and __declspec(restrict), just like MSVC 1900
- Support JEMALLOC_HAS_RESTRICT by defining the restrict keyword
- Move __declspec(nothrow) between 'void' and '*' so it compiles once more
2015-08-04 09:01:48 -07:00
Jason Evans
b49a334a64 Generalize chunk management hooks.
Add the "arena.<i>.chunk_hooks" mallctl, which replaces and expands on
the "arena.<i>.chunk.{alloc,dalloc,purge}" mallctls.  The chunk hooks
allow control over chunk allocation/deallocation, decommit/commit,
purging, and splitting/merging, such that the application can rely on
jemalloc's internal chunk caching and retaining functionality, yet
implement a variety of chunk management mechanisms and policies.

Merge the chunks_[sz]ad_{mmap,dss} red-black trees into
chunks_[sz]ad_retained.  This slightly reduces how hard jemalloc tries
to honor the dss precedence setting; prior to this change the precedence
setting was also consulted when recycling chunks.

Fix chunk purging.  Don't purge chunks in arena_purge_stashed(); instead
deallocate them in arena_unstash_purged(), so that the dirty memory
linkage remains valid until after the last time it is used.

This resolves #176 and #201.
2015-08-03 21:49:02 -07:00
Jason Evans
d059b9d6a1 Implement support for non-coalescing maps on MinGW.
- Do not reallocate huge objects in place if the number of backing
  chunks would change.
- Do not cache multi-chunk mappings.

This resolves #213.
2015-07-24 18:39:14 -07:00
Jason Evans
87ccb55547 Fix huge_palloc() to handle size rather than usize input.
huge_ralloc() passes a size that may not be precisely a size class, so
make huge_palloc() handle the more general case of a size input rather
than usize.

This regression appears to have been introduced by the addition of
in-place huge reallocation; as such it was never incorporated into a
release.
2015-07-23 17:18:49 -07:00
Jason Evans
4becdf21dc Fix sa2u() regression.
Take large_pad into account when determining whether an aligned
allocation can be satisfied by a large size class.

This regression was introduced by
8a03cf039c (Implement cache index
randomization for large allocations.).
2015-07-23 17:14:11 -07:00
Jason Evans
71cd2f08ff Leave PRI* macros defined after using them to define FMT*.
Macro expansion happens too late for the #undef directives to work as a
mechanism for preventing accidental direct use of the PRI* macros.
2015-07-23 15:50:09 -07:00
Jason Evans
5fae7dc1b3 Fix MinGW-related portability issues.
Create and use FMT* macros that are equivalent to the PRI* macros that
inttypes.h defines.  This allows uniform use of the Unix-specific format
specifiers, e.g. "%zu", as well as avoiding Windows-specific definitions
of e.g. PRIu64.

Add ffs()/ffsl() support for compiling with gcc.

Extract compatibility definitions of ENOENT, EINVAL, EAGAIN, EPERM,
ENOMEM, and ENORANGE into include/msvc_compat/windows_extra.h and
use the file for tests as well as for core jemalloc code.
2015-07-23 13:56:25 -07:00
Jason Evans
e42c309eba Add JEMALLOC_FORMAT_PRINTF().
Replace JEMALLOC_ATTR(format(printf, ...). with
JEMALLOC_FORMAT_PRINTF(), so that configuration feature tests can
omit the attribute if it would cause extraneous compilation warnings.
2015-07-22 15:44:47 -07:00
Jason Evans
00632609df Move JEMALLOC_NOTHROW just after return type.
Only use __declspec(nothrow) in C++ mode.

This resolves #244.
2015-07-21 08:21:13 -07:00
Mike Hommey
50cd636eed Remove JEMALLOC_ALLOC_SIZE annotations on functions not returning pointers
As per gcc documentation:
  The alloc_size attribute is used to tell the compiler that the function
  return value points to memory (...)

This resolves #245.
2015-07-21 09:16:07 +09:00
Dave Rigby
37fd1115c3 Remove extraneous ';' on closing 'extern "C"'
Fixes warning with newer GCCs:

    include/jemalloc/jemalloc.h:229:2: warning: extra ';' [-Wpedantic]
      };
       ^
2015-07-16 11:37:19 +01:00
Jason Evans
5bd879646c Change default chunk size from 256 KiB to 2 MiB.
This change improves interaction with transparent huge pages, e.g.
reduced page faults (at least in the absence of unused dirty page
purging).
2015-07-15 17:15:26 -07:00
Jason Evans
aa2826621e Revert to first-best-fit run/chunk allocation.
This effectively reverts 97c04a9383 (Use
first-fit rather than first-best-fit run/chunk allocation.).  In some
pathological cases, first-fit search dominates allocation time, and it
also tends not to converge as readily on a steady state of memory
layout, since precise allocation order has a bigger effect than for
first-best-fit.
2015-07-15 17:15:19 -07:00
Jason Evans
0b8f0bc0a4 Add configure test for alloc_size attribute. 2015-07-10 16:41:12 -07:00
Jason Evans
ae93d6bf36 Avoid function prototype incompatibilities.
Add various function attributes to the exported functions to give the
compiler more information to work with during optimization, and also
specify throw() when compiling with C++ on Linux, in order to adequately
match what __THROW does in glibc.

This resolves #237.
2015-07-10 16:09:40 -07:00
Jason Evans
dde067264d Fix an integer overflow bug in {size2index,s2u}_compute().
This {bug,regression} was introduced by
155bfa7da1 (Normalize size classes.).

This resolves #241.
2015-07-09 21:36:33 -07:00
Jason Evans
0313607e66 Fix MinGW build warnings.
Conditionally define ENOENT, EINVAL, etc. (was unconditional).

Add/use PRIzu, PRIzd, and PRIzx for use in malloc_printf() calls.  gcc issued
(harmless) warnings since e.g. "%zu" should be "%Iu" on Windows, and the
alternative to this workaround would have been to disable the function
attributes which cause gcc to look for type mismatches in formatted printing
function calls.
2015-07-07 20:10:28 -07:00
Matthijs
a1aaf949a5 Optimizations for Windows
- Set opt_lg_chunk based on run-time OS setting
- Verify LG_PAGE is compatible with run-time OS setting
- When targeting Windows Vista or newer, use SRWLOCK instead of CRITICAL_SECTION
- When targeting Windows Vista or newer, statically initialize init_lock
2015-06-25 22:53:58 +02:00
Jason Evans
241abc601b Fix size class overflow handling when profiling is enabled.
Fix size class overflow handling for malloc(), posix_memalign(),
memalign(), calloc(), and realloc() when profiling is enabled.

Remove an assertion that erroneously caused arena_sdalloc() to fail when
profiling was enabled.

This resolves #232.
2015-06-23 18:56:14 -07:00
Jason Evans
0a9f9a4d51 Convert arena_maybe_purge() recursion to iteration.
This resolves #235.
2015-06-22 18:50:58 -07:00
Jason Evans
713b844bff Update a comment. 2015-06-15 12:01:05 -07:00
Chi-hung Hsieh
c073f8167a Fix type errors in C11 versions of atomic_*() functions. 2015-05-27 20:33:18 -07:00
Jason Evans
836bbe9951 Impose a minimum tcache count for small size classes.
Now that small allocation runs have fewer regions due to run metadata
residing in chunk headers, an explicit minimum tcache count is needed to
make sure that tcache adequately amortizes synchronization overhead.
2015-05-19 17:47:16 -07:00
Jason Evans
6591ff09d8 Fix arena_dalloc() performance regression.
Take into account large_pad when computing whether to pass the
deallocation request to tcache_dalloc_large(), so that the largest
cacheable size makes it back to tcache.  This regression was introduced
by 8a03cf039c (Implement cache index
randomization for large allocations.).
2015-05-19 17:44:45 -07:00
Jason Evans
fd5f9e43c3 Avoid atomic operations for dependent rtree reads. 2015-05-15 17:02:30 -07:00
Jason Evans
c451831264 Fix type punning in calls to atomic operation functions. 2015-05-07 22:35:40 -07:00
Jason Evans
8a03cf039c Implement cache index randomization for large allocations.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.

Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.

For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size.  Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.

Add the --disable-cache-oblivious option, primarily intended for
performance testing.

This resolves #13.
2015-05-06 13:27:39 -07:00
Jason Evans
562d266511 Add the "stats.arenas.<i>.lg_dirty_mult" mallctl. 2015-03-24 16:41:38 -07:00
Jason Evans
4acd75a694 Add the "stats.allocated" mallctl. 2015-03-23 17:26:53 -07:00
Igor Podlesny
8ad6bf360f Fix indentation inconsistencies. 2015-03-22 00:09:04 -07:00
Jason Evans
e0a08a1496 Restore --enable-ivsalloc.
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.

This resolves #211.
2015-03-18 21:06:58 -07:00
Jason Evans
8d6a3e8321 Implement dynamic per arena control over dirty page purging.
Add mallctls:
- arenas.lg_dirty_mult is initialized via opt.lg_dirty_mult, and can be
  modified to change the initial lg_dirty_mult setting for newly created
  arenas.
- arena.<i>.lg_dirty_mult controls an individual arena's dirty page
  purging threshold, and synchronously triggers any purging that may be
  necessary to maintain the constraint.
- arena.<i>.chunk.purge allows the per arena dirty page purging function
  to be replaced.

This resolves #93.
2015-03-18 18:55:33 -07:00
Mike Hommey
c9db461ffb Use InterlockedCompareExchange instead of non-existing InterlockedCompareExchange32 2015-03-17 12:09:30 +09:00
Jason Evans
04211e2266 Fix heap profiling regressions.
Remove the prof_tctx_state_destroying transitory state and instead add
the tctx_uid field, so that the tuple <thr_uid, tctx_uid> uniquely
identifies a tctx.  This assures that tctx's are well ordered even when
more than two with the same thr_uid coexist.  A previous attempted fix
based on prof_tctx_state_destroying was only sufficient for protecting
against two coexisting tctx's, but it also introduced a new dumping
race.

These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.) and 764b00023f (Fix a heap
profiling regression.).
2015-03-16 15:11:06 -07:00
Jason Evans
764b00023f Fix a heap profiling regression.
Add the prof_tctx_state_destroying transitionary state to fix a race
between a thread destroying a tctx and another thread creating a new
equivalent tctx.

This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
2015-03-14 14:01:35 -07:00
Jason Evans
fbd8d773ad Fix unsigned comparison underflow.
These bugs only affected tests and debug builds.
2015-03-11 23:14:50 -07:00
Jason Evans
f5c8f37259 Normalize rdelm/rd structure field naming. 2015-03-10 18:29:49 -07:00
Jason Evans
38e42d311c Refactor dirty run linkage to reduce sizeof(extent_node_t). 2015-03-10 18:15:40 -07:00
Jason Evans
97c04a9383 Use first-fit rather than first-best-fit run/chunk allocation.
This tends to more effectively pack active memory toward low addresses.
However, additional tree searches are required in many cases, so whether
this change stands the test of time will depend on real-world
benchmarks.
2015-03-06 20:21:41 -08:00
Jason Evans
f044bb219e Change default chunk size from 4 MiB to 256 KiB.
Recent changes have improved huge allocation scalability, which removes
upward pressure to set the chunk size so large that huge allocations are
rare.  Smaller chunks are more likely to completely drain, so set the
default to the smallest size that doesn't leave excessive unusable
trailing space in chunk headers.
2015-03-06 20:18:34 -08:00
Mike Hommey
4d871f73af Preserve LastError when calling TlsGetValue
TlsGetValue has a semantic difference with pthread_getspecific, in that it
can return a non-error NULL value, so it always sets the LastError.
But allocator callers may not be expecting calling e.g. free() to change
the value of the last error, so preserve it.
2015-03-04 09:50:33 -08:00
Mike Hommey
7c46fd59cc Make --without-export actually work
9906660 added a --without-export configure option to avoid exporting
jemalloc symbols, but the option didn't actually work.
2015-03-04 21:49:15 +09:00
Jason Evans
99bd94fb65 Fix chunk cache races.
These regressions were introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
2015-02-18 16:40:53 -08:00
Jason Evans
738e089a2e Rename "dirty chunks" to "cached chunks".
Rename "dirty chunks" to "cached chunks", in order to avoid overloading
the term "dirty".

Fix the regression caused by 339c2b23b2
(Fix chunk_unmap() to propagate dirty state.), and actually address what
that change attempted, which is to only purge chunks once, and propagate
whether zeroed pages resulted into chunk_record().
2015-02-18 01:15:50 -08:00
Jason Evans
339c2b23b2 Fix chunk_unmap() to propagate dirty state.
Fix chunk_unmap() to propagate whether a chunk is dirty, and modify
dirty chunk purging to record this information so it can be passed to
chunk_unmap().  Since the broken version of chunk_unmap() claimed that
all chunks were clean, this resulted in potential memory corruption for
purging implementations that do not zero (e.g. MADV_FREE).

This regression was introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
2015-02-17 22:25:56 -08:00
Jason Evans
47701b22ee arena_chunk_dirty_node_init() --> extent_node_dirty_linkage_init() 2015-02-17 22:23:10 -08:00
Jason Evans
eafebfdfbe Remove obsolete type arena_chunk_miscelms_t. 2015-02-17 16:12:31 -08:00
Jason Evans
a4e1888d1a Simplify extent_node_t and add extent_node_init(). 2015-02-17 15:13:52 -08:00
Jason Evans
ee41ad409a Integrate whole chunks into unused dirty page purging machinery.
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs.  Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty.  Preferrentially
allocate dirty chunks.  When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).

Refactor extent_node_t to provide accessor functions.
2015-02-16 21:02:17 -08:00
Jason Evans
40ab8f98e4 Remove more obsolete (incorrect) assertions.
This regression was introduced by
88fef7ceda (Refactor huge_*() calls into
arena internals.), and went undetected because of the --enable-debug
regression.
2015-02-15 20:26:45 -08:00
Jason Evans
cb9b44914e Remove obsolete (incorrect) assertions.
This regression was introduced by
88fef7ceda (Refactor huge_*() calls into
arena internals.), and went undetected because of the --enable-debug
regression.
2015-02-15 20:13:28 -08:00
Jason Evans
2195ba4e1f Normalize *_link and link_* fields to all be *_link. 2015-02-15 16:43:52 -08:00
Jason Evans
41cfe03f39 If MALLOCX_ARENA(a) is specified, use it during tcache fill. 2015-02-13 15:28:56 -08:00
Jason Evans
5f7140b045 Make prof_tctx accesses atomic.
Although exceedingly unlikely, it appears that writes to the prof_tctx
field of arena_chunk_map_misc_t could be reordered such that a stale
value could be read during deallocation, with profiler metadata
corruption and invalid pointer dereferences being the most likely
effects.
2015-02-12 15:54:53 -08:00
Jason Evans
88fef7ceda Refactor huge_*() calls into arena internals.
Make redirects to the huge_*() API the arena code's responsibility,
since arenas now take responsibility for all allocation sizes.
2015-02-12 14:06:37 -08:00
Jason Evans
cbf3a6d703 Move centralized chunk management into arenas.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.

Add chunk node caching to arenas, in order to avoid contention on the
base allocator.

Use chunks_rtree to look up huge allocations rather than a red-black
tree.  Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).

Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled.  The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.

Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering.  These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.

Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
2015-02-12 00:15:56 -08:00
Jason Evans
051eae8cc5 Remove unnecessary xchg* lock prefixes. 2015-02-10 16:05:52 -08:00
Jason Evans
1cb181ed63 Implement explicit tcache support.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.

Add the tcache.create, tcache.flush, and tcache.destroy mallctls.

This resolves #145.
2015-02-09 17:44:48 -08:00
Jason Evans
23694b0745 Fix arena_get() for (!init_if_missing && refresh_if_missing) case.
Fix arena_get() to refresh the cache as needed in the (!init_if_missing
&& refresh_if_missing) case.

This flaw was introduced by the initial arena_get() implementation,
which was part of 8bb3198f72 (Refactor/fix
arenas manipulation.).
2015-02-09 17:43:10 -08:00
Jason Evans
8d0e04d42f Refactor rtree to be lock-free.
Recent huge allocation refactoring associates huge allocations with
arenas, but it remains necessary to quickly look up huge allocation
metadata during reallocation/deallocation.  A global radix tree remains
a good solution to this problem, but locking would have become the
primary bottleneck after (upcoming) migration of chunk management from
global to per arena data structures.

This lock-free implementation uses double-checked reads to traverse the
tree, so that in the steady state, each read or write requires only a
single atomic operation.

This implementation also assures that no more than two tree levels
actually exist, through a combination of careful virtual memory
allocation which makes large sparse nodes cheap, and skipping the root
node on x64 (possible because the top 16 bits are all 0 in practice).
2015-02-04 16:51:53 -08:00
Jason Evans
c810fcea1f Add (x != 0) assertion to lg_floor(x).
lg_floor(0) is undefined, but depending on compiler options may not
cause a crash.  This assertion makes it harder to accidentally abuse
lg_floor().
2015-02-04 16:51:53 -08:00
Jason Evans
f500a10b2e Refactor base_alloc() to guarantee demand-zeroed memory.
Refactor base_alloc() to guarantee that allocations are carved from
demand-zeroed virtual memory.  This supports sparse data structures such
as multi-page radix tree nodes.

Enhance base_alloc() to keep track of fragments which were too small to
support previous allocation requests, and try to consume them during
subsequent requests.  This becomes important when request sizes commonly
approach or exceed the chunk size (as could radix tree node
allocations).
2015-02-04 16:51:53 -08:00
Jason Evans
918a1a5b3f Reduce extent_node_t size to fit in one cache line. 2015-02-04 16:51:53 -08:00
Jason Evans
a55dfa4b0a Implement more atomic operations.
- atomic_*_p().
- atomic_cas_*().
- atomic_write_*().
2015-02-04 16:50:05 -08:00
Jason Evans
f8723572d8 Add missing prototypes for bootstrap_{malloc,calloc,free}(). 2015-02-04 16:50:04 -08:00
Jason Evans
5b8ed5b7c9 Implement the prof.gdump mallctl.
This feature makes it possible to toggle the gdump feature on/off during
program execution, whereas the the opt.prof_dump mallctl value can only
be set during program startup.

This resolves #72.
2015-01-25 21:21:35 -08:00
Jason Evans
4581b97809 Implement metadata statistics.
There are three categories of metadata:

- Base allocations are used for bootstrap-sensitive internal allocator
  data structures.
- Arena chunk headers comprise pages which track the states of the
  non-metadata pages.
- Internal allocations differ from application-originated allocations
  in that they are for internal use, and that they are omitted from heap
  profiles.

The metadata statistics comprise the metadata categories as follows:

- stats.metadata: All metadata -- base + arena chunk headers + internal
  allocations.
- stats.arenas.<i>.metadata.mapped: Arena chunk headers.
- stats.arenas.<i>.metadata.allocated: Internal allocations.  This is
  reported separately from the other metadata statistics because it
  overlaps with the allocated and active statistics, whereas the other
  metadata statistics do not.

Base allocations are not reported separately, though their magnitude can
be computed by subtracting the arena-specific metadata.

This resolves #163.
2015-01-23 23:34:43 -08:00
Jason Evans
10aff3f3e1 Refactor bootstrapping to delay tsd initialization.
Refactor bootstrapping to delay tsd initialization, primarily to support
integration with FreeBSD's libc.

Refactor a0*() for internal-only use, and add the
bootstrap_{malloc,calloc,free}() API for use by FreeBSD's libc.  This
separation limits use of the a0*() functions to metadata allocation,
which doesn't require malloc/calloc/free API compatibility.

This resolves #170.
2015-01-22 14:04:27 -08:00
Abhishek Kulkarni
b617df81bb Add missing symbols to private_symbols.txt.
This resolves #185.
2015-01-21 12:44:35 -08:00
Guilherme Goncalves
51f86346c0 Add a isblank definition for MSVC < 2013 2015-01-09 14:33:46 -08:00
Guilherme Goncalves
2c5cb613df Introduce two new modes of junk filling: "alloc" and "free".
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.

This resolves #172.
2014-12-14 17:07:26 -08:00
Daniel Micay
b74041fb6e Ignore MALLOC_CONF in set{uid,gid,cap} binaries.
This eliminates the malloc tunables as tools for an attacker.

Closes #173
2014-12-14 15:36:15 -08:00
Jason Evans
e12eaf93dc Style and spelling fixes. 2014-12-08 16:34:04 -08:00
Chih-hung Hsieh
59cd80e6c6 Add a C11 atomics-based implementation of atomic.h API. 2014-12-06 21:17:49 -08:00
Jason Evans
a18c2b1f15 Style fixes. 2014-12-05 17:49:47 -08:00
Daniel Micay
879e76a9e5 teach the dss chunk allocator to handle new_addr
This provides in-place expansion of huge allocations when the end of the
allocation is at the end of the sbrk heap. There's already the ability
to extend in-place via recycled chunks but this handles the initial
growth of the heap via repeated vector / string reallocations.

A possible future extension could allow realloc to go from the following:

    | huge allocation | recycled chunks |
                                        ^ dss_end

To a larger allocation built from recycled *and* new chunks:

    |                      huge allocation                      |
                                                                ^ dss_end

Doing that would involve teaching the chunk recycling code to request
new chunks to satisfy the request. The chunk_dss code wouldn't require
any further changes.

    #include <stdlib.h>

    int main(void) {
        size_t chunk = 4 * 1024 * 1024;
        void *ptr = NULL;
        for (size_t size = chunk; size < chunk * 128; size *= 2) {
            ptr = realloc(ptr, size);
            if (!ptr) return 1;
        }
    }

dss:secondary: 0.083s
dss:primary: 0.083s

After:

dss:secondary: 0.083s
dss:primary: 0.003s

The dss heap grows in the upwards direction, so the oldest chunks are at
the low addresses and they are used first. Linux prefers to grow the
mmap heap downwards, so the trick will not work in the *current* mmap
chunk allocator as a huge allocation will only be at the top of the heap
in a contrived case.
2014-11-28 16:11:19 -08:00
Guilherme Goncalves
a2136025c4 Remove extra definition of je_tsd_boot on win32. 2014-11-18 19:08:18 -02:00
Jason Evans
9cf2be0a81 Make quarantine_init() static. 2014-11-07 14:50:38 -08:00
Jason Evans
c002a5c800 Fix two quarantine regressions.
Fix quarantine to actually update tsd when expanding, and to avoid
double initialization (leaking the first quarantine) due to recursive
initialization.

This resolves #161.
2014-11-04 18:03:11 -08:00
Jason Evans
d7a9bab92d Fix arena_sdalloc() to use promoted size (second attempt).
Unlike the preceeding attempted fix, this version avoids the potential
for converting an invalid bin index to a size class.
2014-10-31 22:26:24 -07:00
Jason Evans
6da2e9d4f6 Fix arena_sdalloc() to use promoted size. 2014-10-31 17:08:13 -07:00
Jason Evans
cfc5706f69 Miscellaneous cleanups. 2014-10-30 23:18:45 -07:00
Daniel Micay
d33f834591 avoid redundant chunk header reads
* use sized deallocation in iralloct_realign
* iralloc and ixalloc always need the old size, so pass it in from the
  caller where it's often already calculated
2014-10-30 17:06:38 -07:00
Daniel Micay
809b0ac391 mark huge allocations as unlikely
This cleans up the fast path a bit more by moving away more code.
2014-10-30 17:06:38 -07:00
Jason Evans
9b41ac909f Fix huge allocation statistics. 2014-10-14 22:20:00 -07:00
Jason Evans
3c4d92e82a Add per size class huge allocation statistics.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
  stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
  stats.arenas.<i>.hchunks.<j>.ndalloc,
  stats.arenas.<i>.hchunks.<j>.nrequests, and
  stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
2014-10-12 23:02:10 -07:00
Jason Evans
44c97b712e Fix a prof_tctx_t/prof_tdata_t cleanup race.
Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
2014-10-12 13:03:20 -07:00
Jason Evans
381c23dd9d Remove arena_dalloc_bin_run() clean page preservation.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation.  This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.

Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t).  These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
2014-10-10 23:01:03 -07:00
Jason Evans
81e547566e Add --with-lg-tiny-min, generalize --with-lg-quantum. 2014-10-10 22:35:07 -07:00
Jason Evans
fc0b3b7383 Add configure options.
Add:
  --with-lg-page
  --with-lg-page-sizes
  --with-lg-size-class-group
  --with-lg-quantum

Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.

Fix various edge conditions exposed by the configure options.
2014-10-09 22:44:37 -07:00
Daniel Micay
f22214a29d Use regular arena allocation for huge tree nodes.
This avoids grabbing the base mutex, as a step towards fine-grained
locking for huge allocations. The thread cache also provides a tiny
(~3%) improvement for serial huge allocations.
2014-10-07 23:57:09 -07:00
Jason Evans
8bb3198f72 Refactor/fix arenas manipulation.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind].  Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.

Add a tsd-based arenas_cache, which amortizes arenas reads.  This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.

Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).

Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).

Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used.  Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
2014-10-07 23:14:57 -07:00
Jason Evans
155bfa7da1 Normalize size classes.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes.  Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB).  Large size classes range from [16 KiB .. 4 MiB).  Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).
2014-10-06 01:45:13 -07:00
Daniel Micay
a95018ee81 Attempt to expand huge allocations in-place.
This adds support for expanding huge allocations in-place by requesting
memory at a specific address from the chunk allocator.

It's currently only implemented for the chunk recycling path, although
in theory it could also be done by optimistically allocating new chunks.
On Linux, it could attempt an in-place mremap. However, that won't work
in practice since the heap is grown downwards and memory is not unmapped
(in a normal build, at least).

Repeated vector reallocation micro-benchmark:

    #include <string.h>
    #include <stdlib.h>

    int main(void) {
        for (size_t i = 0; i < 100; i++) {
            void *ptr = NULL;
            size_t old_size = 0;
            for (size_t size = 4; size < (1 << 30); size *= 2) {
                ptr = realloc(ptr, size);
                if (!ptr) return 1;
                memset(ptr + old_size, 0xff, size - old_size);
                old_size = size;
            }
            free(ptr);
        }
    }

The glibc allocator fails to do any in-place reallocations on this
benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it
elides the cost of copies via mremap, which is currently not something
that jemalloc can use.

With this improvement, jemalloc still fails to do any in-place huge
reallocations for the first outer loop, but then succeeds 100% of the
time for the remaining 99 iterations. The time spent doing allocations
and copies drops down to under 5%, with nearly all of it spent doing
purging + faulting (when huge pages are disabled) and the array memset.

An improved mremap API (MREMAP_RETAIN - #138) would be far more general
but this is a portable optimization and would still be useful on Linux
for xallocx.

Numbers with transparent huge pages enabled:

glibc (copies elided via MREMAP_MAYMOVE): 8.471s

jemalloc: 17.816s
jemalloc + no-op madvise: 13.236s

jemalloc + this commit: 6.787s
jemalloc + this commit + no-op madvise: 6.144s

Numbers with transparent huge pages disabled:

glibc (copies elided via MREMAP_MAYMOVE): 15.403s

jemalloc: 39.456s
jemalloc + no-op madvise: 12.768s

jemalloc + this commit: 15.534s
jemalloc + this commit + no-op madvise: 6.354s

Closes #137
2014-10-05 14:47:01 -07:00
Jason Evans
e9a3fa2e09 Add missing header includes in jemalloc/jemalloc.h .
Add stdlib.h, stdbool.h, and stdint.h to jemalloc/jemalloc.h so that
applications only have to #include <jemalloc/jemalloc.h>.

This resolves #132.
2014-10-05 12:05:37 -07:00
Jason Evans
16854ebeb7 Don't disable tcache for lazy-lock.
Don't disable tcache when lazy-lock is configured.  There already exists
a mechanism to disable tcache, but doing so automatically due to
lazy-lock causes surprising performance behavior.
2014-10-04 15:00:51 -07:00
Jason Evans
34e85b4182 Make prof-related inline functions always-inline. 2014-10-04 11:26:05 -07:00
Jason Evans
029d44cf8b Fix tsd cleanup regressions.
Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.).  These regressions were twofold:

1) tsd_tryget() should never (and need never) return NULL.  Rename it to
   tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
   because cleanup happens during the nominal-->purgatory transition,
   and re-initialization must not happen while in the purgatory state.
   Add tsd_nominal() and use it as needed.  Note that tsd_*{p,}_get()
   can still be used as long as no re-initialization that would require
   cleanup occurs.  This means that e.g. the thread_allocated counter
   can be updated unconditionally.
2014-10-04 11:22:55 -07:00
Jason Evans
fc12c0b8bc Implement/test/fix prof-related mallctl's.
Implement/test/fix the opt.prof_thread_active_init,
prof.thread_active_init, and thread.prof.active mallctl's.

Test/fix the thread.prof.name mallctl.

Refactor opt_prof_active to be read-only and move mutable state into the
prof_active variable.  Stop leaning on ctl-related locking for
protection.
2014-10-03 23:25:30 -07:00
Jason Evans
551ebc4364 Convert to uniform style: cond == false --> !cond 2014-10-03 10:16:09 -07:00
Jason Evans
20c31deaae Test prof.reset mallctl and fix numerous discovered bugs. 2014-10-02 23:01:10 -07:00
Eric Wong
4dcf04bfc0 correctly detect adaptive mutexes in pthreads
PTHREAD_MUTEX_ADAPTIVE_NP is an enum on glibc and not a macro,
we must test for their existence by attempting compilation.
2014-09-29 16:10:40 -07:00
Jason Evans
5d9732f2cf Merge pull request #129 from daverigby/msvc_lg_floor
Use MSVC intrinsics for lg_floor
2014-09-29 15:15:31 -07:00
Jason Evans
0c5dd03e88 Move small run metadata into the arena chunk header.
Move small run metadata into the arena chunk header, with multiple
expected benefits:
- Lower run fragmentation due to reduced run sizes; runs are more likely
  to completely drain when there are fewer total regions.
- Improved cache behavior.  Prior to this change, run headers were
  always page-aligned, which put extra pressure on some CPU cache sets.
  The degree to which this was a problem was hardware dependent, but it
  likely hurt some even for the most advanced modern hardware.
- Buffer overruns/underruns are less likely to corrupt allocator
  metadata.
- Size classes between 4 KiB and 16 KiB become reasonable to support
  without any special handling, and the runs are small enough that dirty
  unused pages aren't a significant concern.
2014-09-29 01:31:39 -07:00
Jason Evans
f97e5ac4ec Implement compile-time bitmap size computation. 2014-09-28 14:43:11 -07:00
Jason Evans
6ef80d68f0 Fix profile dumping race.
Fix a race that caused a non-critical assertion failure.  To trigger the
race, a thread had to be part way through initializing a new sample,
such that it was discoverable by the dumping thread, but not yet linked
into its gctx by the time a later dump phase would normally have reset
its state to 'nominal'.

Additionally, lock access to the state field during modification to
transition to the dumping state.  It's not apparent that this oversight
could have caused an actual problem due to outer locking that protects
the dumping machinery, but the added locking pedantically follows the
stated locking protocol for the state field.
2014-09-24 22:23:43 -07:00