Commit Graph

760 Commits

Author SHA1 Message Date
David Goldblatt
31b43219db Header refactoring: size_classes module - remove from the catchall 2017-04-24 10:33:21 -07:00
David Goldblatt
68da2361d2 Header refactoring: ckh module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
bf2dc7e678 Header refactoring: ticker module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
fa3ad730c4 Header refactoring: prng module - remove from the catchall and unify. 2017-04-24 10:33:21 -07:00
David Goldblatt
4d2e4bf5eb Get rid of most of the various inline macros. 2017-04-24 10:33:21 -07:00
David Goldblatt
425253e2cd Enable -Wundef, when supported.
This can catch bugs in which one header defines a numeric constant, and another
uses it without including the defining header. Undefined preprocessor symbols
expand to '0', so that this will compile fine, silently doing the math wrong.
2017-04-21 17:03:56 -07:00
Jason Evans
3823effe12 Remove --enable-ivsalloc.
Continue to use ivsalloc() when --enable-debug is specified (and add
assertions to guard against 0 size), but stop providing a documented
explicit semantics-changing band-aid to dodge undefined behavior in
sallocx() and malloc_usable_size().  ivsalloc() remains compiled in,
unlike when #211 restored --enable-ivsalloc, and if
JEMALLOC_FORCE_IVSALLOC is defined during compilation, sallocx() and
malloc_usable_size() will still use ivsalloc().

This partially resolves #580.
2017-04-21 14:34:35 -07:00
Jim Chen
ae248a2160 Use openat syscall if available
Some architectures like AArch64 may not have the open syscall because it
was superseded by the openat syscall, so check and use SYS_openat if
SYS_open is not available.

Additionally, Android headers for AArch64 define SYS_open to __NR_open,
even though __NR_open is undefined. Undefine SYS_open in that case so
SYS_openat is used.
2017-04-21 10:58:42 -07:00
Jason Evans
4403c9ab44 Remove --disable-tcache.
Simplify configuration by removing the --disable-tcache option, but
replace the testing for that configuration with
--with-malloc-conf=tcache:false.

Fix the thread.arena and thread.tcache.flush mallctls to work correctly
if tcache is disabled.

This partially resolves #580.
2017-04-21 10:06:12 -07:00
Qi Wang
5aa46f027d Bypass extent tracking for auto arenas.
Tracking extents is required by arena_reset.  To support this, the extent
linkage was used for tracking 1) large allocations, and 2) full slabs.  However
modifying the extent linkage could be an expensive operation as it likely incurs
cache misses.  Since we forbid arena_reset on auto arenas, let's bypass the
linkage operations for auto arenas.
2017-04-21 00:29:18 -07:00
Jason Evans
da4cff0279 Support --with-lg-page values larger than system page size.
All mappings continue to be PAGE-aligned, even if the system page size
is smaller.  This change is primarily intended to provide a mechanism
for supporting multiple page sizes with the same binary; smaller page
sizes work better in conjunction with jemalloc's design.

This resolves #467.
2017-04-18 19:01:04 -07:00
Jason Evans
45f087eb03 Revert "Remove BITMAP_USE_TREE."
Some systems use a native 64 KiB page size, which means that the bitmap
for the smallest size class can be 8192 bits, not just 512 bits as when
the page size is 4 KiB.  Linear search in bitmap_{sfu,ffu}() is
unacceptably slow for such large bitmaps.

This reverts commit 7c00f04ff4.
2017-04-18 19:01:04 -07:00
David Goldblatt
38e847c1c5 Header refactoring: unify spin.h and move it out of the catch-all. 2017-04-18 18:35:03 -07:00
David Goldblatt
418d96a86c Header refactoring: unify nstime.h and move it out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
7ebc83894f Header refactoring: move jemalloc_internal_types.h out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
d9ec36e22d Header refactoring: move assert.h out of the catch-all 2017-04-18 18:35:03 -07:00
David Goldblatt
f692e6c214 Header refactoring: move util.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
54373be084 Header refactoring: move malloc_io.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
0b00ffe55f Header refactoring: move bit_util.h out of the catchall 2017-04-18 18:35:03 -07:00
David Goldblatt
22366518b7 Move CPP_PROLOGUE and CPP_EPILOGUE to the .cpp
This lets us avoid having to specify them in every C file.
2017-04-18 18:35:03 -07:00
Jason Evans
881fbf762f Prefer old/low extent_t structures during reuse.
Rather than using a LIFO queue to track available extent_t structures,
use a red-black tree, and always choose the oldest/lowest available
during reuse.
2017-04-17 14:47:45 -07:00
Jason Evans
76b35f4b2f Track extent structure serial number (esn) in extent_t.
This enables stable sorting of extent_t structures.
2017-04-17 14:47:45 -07:00
Jason Evans
69aa552809 Allocate increasingly large base blocks.
Limit the total number of base block by leveraging the exponential
size class sequence, similarly to extent_grow_retained().
2017-04-17 14:47:45 -07:00
Qi Wang
3c9c41edb2 Improve rtree cache with a two-level cache design.
Two levels of rcache is implemented: a direct mapped cache as L1, combined with
a LRU cache as L2.  The L1 cache offers low cost on cache hit, but could suffer
collision under circumstances.  This is complemented by the L2 LRU cache, which
is slower on cache access (overhead from linear search + reordering), but solves
collison of L1 rather well.
2017-04-17 12:05:23 -07:00
Qi Wang
d16f1e53df Skip percpu arena when choosing iarena. 2017-04-16 21:34:44 -07:00
Qi Wang
c2fcf9c2cf Switch to fine-grained reentrancy support.
Previously we had a general detection and support of reentrancy, at the cost of
having branches and inc / dec operations on fast paths.  To avoid taxing fast
paths, we move the reentrancy operations onto tsd slow state, and only modify
reentrancy level around external calls (that might trigger reentrancy).
2017-04-14 19:48:06 -07:00
Qi Wang
b348ba29bb Bundle 3 branches on fast path into tsd_state.
Added tsd_state_nominal_slow, which on fast path malloc() incorporates
tcache_enabled check, and on fast path free() bundles both malloc_slow and
tcache_enabled branches.
2017-04-14 16:58:08 -07:00
Qi Wang
ccfe68a916 Pass alloc_ctx down profiling path.
With this change, when profiling is enabled, we avoid doing redundant rtree
lookups. Also changed dalloc_atx_t to alloc_atx_t, as it's now used on
allocation path as well (to speed up profiling).
2017-04-12 13:55:39 -07:00
Qi Wang
f35213bae4 Pass dalloc_ctx down the sdalloc path.
This avoids redundant rtree lookups.
2017-04-12 13:55:39 -07:00
David Goldblatt
e709fae1d7 Header refactoring: move atomic.h out of the catch-all 2017-04-11 11:52:30 -07:00
David Goldblatt
743d940dc3 Header refactoring: Split up jemalloc_internal.h
This is a biggy.  jemalloc_internal.h has been doing multiple jobs for a while
now:
- The source of system-wide definitions.
- The catch-all include file.
- The module header file for jemalloc.c

This commit splits up this functionality.  The system-wide definitions
responsibility has moved to jemalloc_preamble.h.  The catch-all include file is
now jemalloc_internal_includes.h.  The module headers for jemalloc.c are now in
jemalloc_internal_[externs|inlines|types].h, just as they are for the other
modules.
2017-04-11 11:52:30 -07:00
David Goldblatt
0237870c60 Header refactoring: break out ql.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
610cb83419 Header refactoring: break out qr.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
63a5cd4cc2 Header refactoring: break out rb.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
2f00ce4da7 Header refactoring: break out ph.h dependencies 2017-04-11 11:52:30 -07:00
David Goldblatt
57e36e1a12 Header refactoring: Add CPP_PROLOGUE and CPP_EPILOGUE macros 2017-04-11 11:52:30 -07:00
Qi Wang
bfa530b75b Pass dealloc_ctx down free() fast path.
This gets rid of the redundent rtree lookup down fast path.
2017-04-11 09:58:12 -07:00
Qi Wang
04ef218d87 Move reentrancy_level to the beginning of TSD. 2017-04-07 16:25:43 -07:00
David Goldblatt
b407a65401 Add basic reentrancy-checking support, and allow arena_new to reenter.
This checks whether or not we're reentrant using thread-local data, and, if we
are, moves certain internal allocations to use arena 0 (which should be properly
initialized after bootstrapping).

The immediate thing this allows is spinning up threads in arena_new, which will
enable spinning up background threads there.
2017-04-07 14:10:27 -07:00
David Goldblatt
0a0fcd3e6a Add hooking functionality
This allows us to hook chosen functions and do interesting things there (in
particular: reentrancy checking).
2017-04-07 14:10:27 -07:00
Qi Wang
36bd90b962 Optimizing TSD and thread cache layout.
1) Re-organize TSD so that frequently accessed fields are closer to the
beginning and more compact.  Assuming 64-bit, the first 2.5 cachelines now
contains everything needed on tcache fast path, expect the tcache struct itself.

2) Re-organize tcache and tbins.  Take lg_fill_div out of tbin, and reduce tbin
to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and
place tbins_small close to the beginning.
2017-04-07 14:06:17 -07:00
Qi Wang
0fba57e579 Get rid of tcache_enabled_t as we have runtime init support. 2017-04-07 10:42:29 -07:00
Qi Wang
fde3e20cc0 Integrate auto tcache into TSD.
The embedded tcache is initialized upon tsd initialization.  The avail arrays
for the tbins will be allocated / deallocated accordingly during init / cleanup.

With this change, the pointer to the auto tcache will always be available, as
long as we have access to the TSD.  tcache_available() (called in tcache_get())
is provided to check if we should use tcache.
2017-04-07 09:55:14 -07:00
David Goldblatt
eeabdd2466 Remove the pre-C11-atomics API, which is now unused 2017-04-05 16:25:37 -07:00
David Goldblatt
5dcc13b342 Make the mutex n_waiting_thds field a C11-style atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
30d74db08e Convert accumbytes in prof_accum_t to C11 atomics, when possible 2017-04-05 16:25:37 -07:00
David Goldblatt
92aafb0efe Make base_t's extent_hooks field C11-atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
56b72c7b17 Transition arena struct fields to C11 atomics 2017-04-05 16:25:37 -07:00
David Goldblatt
bc32ec3503 Move arena-tracking atomics in jemalloc.c to C11-style 2017-04-05 16:25:37 -07:00
David Goldblatt
864adb7f42 Transition e_prof_tctx in struct extent to C11 atomics 2017-04-04 16:46:04 -07:00
David Goldblatt
7da04a6b09 Convert prng module to use C11-style atomics 2017-04-04 16:45:52 -07:00
Qi Wang
492e9f301e Make the tsd member init functions to take tsd_t * type. 2017-04-04 14:06:07 -07:00
Qi Wang
d3cda3423c Do proper cleanup for tsd_state_reincarnated.
Also enable arena_bind under non-nominal state, as the cleanup will be handled
correctly now.
2017-04-04 00:34:49 -07:00
Qi Wang
51d3682950 Remove the leafkey NULL check in leaf_elm_lookup. 2017-04-04 00:27:35 -07:00
Qi Wang
9ed84b0d45 Add init function support to tsd members.
This will facilitate embedding tcache into tsd, which will require proper
initialization cannot be done via the static initializer.  Make tsd->rtree_ctx
to be initialized via rtree_ctx_data_init().
2017-04-04 00:19:21 -07:00
Jason Evans
07f4f93434 Move arena_slab_data_t's nfree into extent_t's e_bits.
Compact extent_t to 128 bytes on 64-bit systems by moving
arena_slab_data_t's nfree into extent_t's e_bits.

Cacheline-align extent_t structures so that they always cross the
minimum number of cacheline boundaries.

Re-order extent_t fields such that all fields except the slab bitmap
(and overlaid heap profiling context pointer) are in the first
cacheline.

This resolves #461.
2017-03-27 22:43:39 -07:00
Qi Wang
af3d737a9a Simplify rtree cache replacement policy.
To avoid memmove on free() fast path, simplify the cache replacement policy to
only bubble up the cache hit element by 1.
2017-03-27 13:42:31 -07:00
Jason Evans
c6d1819e48 Simplify rtree_clear() to avoid locking. 2017-03-27 13:22:52 -07:00
Jason Evans
4020523f67 Fix a race in rtree_szind_slab_update() for RTREE_LEAF_COMPACT. 2017-03-27 13:22:36 -07:00
Jason Evans
7c00f04ff4 Remove BITMAP_USE_TREE.
Remove tree-structured bitmap support, in order to reduce complexity and
ease maintenance.  No bitmaps larger than 512 bits have been necessary
since before 4.0.0, and there is no current plan that would increase
maximum bitmap size.  Although tree-structured bitmaps were used on
32-bit platforms prior to this change, the overall benefits were
questionable (higher metadata overhead, higher bitmap modification cost,
marginally lower search cost).
2017-03-27 12:18:40 -07:00
Jason Evans
6258176c87 Fix bitmap_ffu() to work with 3+ levels. 2017-03-27 12:18:40 -07:00
Jason Evans
735ad8210c Pack various extent_t fields into a bitfield.
This reduces sizeof(extent_t) from 160 to 136 on x64.
2017-03-25 23:30:13 -07:00
Jason Evans
0591c204b4 Store arena index rather than (arena_t *) in extent_t. 2017-03-25 23:30:13 -07:00
Jason Evans
5e12223925 Fix BITMAP_USE_TREE version of bitmap_ffu().
This fixes an extent searching regression on 32-bit systems, caused by
the initial bitmap_ffu() implementation in
c8021d01f6 (Implement bitmap_ffu(), which
finds the first unset bit.), as first used in
5d33233a5e (Use a bitmap in extents_t to
speed up search.).
2017-03-25 23:29:32 -07:00
Jason Evans
5d33233a5e Use a bitmap in extents_t to speed up search.
Rather than iteratively checking all sufficiently large heaps during
search, maintain and use a bitmap in order to skip empty heaps.
2017-03-24 17:52:46 -07:00
Jason Evans
57e353163f Implement BITMAP_GROUPS(). 2017-03-24 17:52:46 -07:00
Jason Evans
c8021d01f6 Implement bitmap_ffu(), which finds the first unset bit. 2017-03-24 17:52:46 -07:00
Qi Wang
362e356675 Profile per arena base mutex, instead of just a0. 2017-03-23 00:03:28 -07:00
Qi Wang
d3fde1c124 Refactor mutex profiling code with x-macros. 2017-03-23 00:03:28 -07:00
Qi Wang
f6698ec1e6 Switch to nstime_t for the time related fields in mutex profiling. 2017-03-23 00:03:28 -07:00
Qi Wang
74f78cafda Added custom mutex spin.
A fixed max spin count is used -- with benchmark results showing it
solves almost all problems. As the benchmark used was rather intense,
the upper bound could be a little bit high. However it should offer a
good tradeoff between spinning and blocking.
2017-03-23 00:03:28 -07:00
Qi Wang
20b8c70e9f Added extents_dirty / _muzzy mutexes, as well as decay_dirty / _muzzy. 2017-03-23 00:03:28 -07:00
Qi Wang
64c5f5c174 Added "stats.mutexes.reset" mallctl to reset all mutex stats.
Also switched from the term "lock" to "mutex".
2017-03-23 00:03:28 -07:00
Qi Wang
ca9074deff Added lock profiling and output for global locks (ctl, prof and base). 2017-03-23 00:03:28 -07:00
Qi Wang
0fb5c0e853 Add arena lock stats output. 2017-03-23 00:03:28 -07:00
Qi Wang
a4f176af57 Output bin lock profiling results to malloc_stats.
Two counters are included for the small bins: lock contention rate, and
max lock waiting time.
2017-03-23 00:03:28 -07:00
Qi Wang
6309df628f First stage of mutex profiling.
Switched to trylock and update counters based on state.
2017-03-23 00:03:28 -07:00
Jason Evans
32e7cf51cd Further specialize arena_[s]dalloc() tcache fast path.
Use tsd_rtree_ctx() rather than tsdn_rtree_ctx() when tcache is
non-NULL, in order to avoid an extra branch (and potentially extra stack
space) in the fast path.
2017-03-22 18:33:32 -07:00
Jason Evans
5e67fbc367 Push down iealloc() calls.
Call iealloc() as deep into call chains as possible without causing
redundant calls.
2017-03-22 18:33:32 -07:00
Jason Evans
51a2ec92a1 Remove extent dereferences from the deallocation fast paths. 2017-03-22 18:33:32 -07:00
Jason Evans
4f341412e5 Remove extent arg from isalloc() and arena_salloc(). 2017-03-22 18:33:32 -07:00
Jason Evans
0ee0e0c155 Implement compact rtree leaf element representation.
If a single virtual adddress pointer has enough unused bits to pack
{szind_t, extent_t *, bool, bool}, use a single pointer-sized field in
each rtree leaf element, rather than using three separate fields.  This
has little impact on access speed (fewer loads/stores, but more bit
twiddling), except that denser representation increases TLB
effectiveness.
2017-03-22 18:33:32 -07:00
Jason Evans
ce41ab0c57 Embed root node into rtree_t.
This avoids one atomic operation per tree access.
2017-03-22 18:33:32 -07:00
Jason Evans
99d68445ef Incorporate szind/slab into rtree leaves.
Expand and restructure the rtree API such that all common operations can
be achieved with minimal work, regardless of whether the rtree leaf
fields are independent versus packed into a single atomic pointer.
2017-03-22 18:33:32 -07:00
Jason Evans
944c8a3383 Split rtree_elm_t into rtree_{node,leaf}_elm_t.
This allows leaf elements to differ in size from internal node elements.

In principle it would be more correct to use a different type for each
level of the tree, but due to implementation details related to atomic
operations, we use casts anyway, thus counteracting the value of
additional type correctness.  Furthermore, such a scheme would require
function code generation (via cpp macros), as well as either unwieldy
type names for leaves or type aliases, e.g.

  typedef struct rtree_elm_d2_s rtree_leaf_elm_t;

This alternate strategy would be more correct, and with less code
duplication, but probably not worth the complexity.
2017-03-22 18:33:32 -07:00
Jason Evans
f50d6009fe Remove binind field from arena_slab_data_t.
binind is now redundant; the containing extent_t's szind field always
provides the same value.
2017-03-22 18:33:32 -07:00
Jason Evans
e8921cf2eb Convert extent_t's usize to szind.
Rather than storing usize only for large (and prof-promoted)
allocations, store the size class index for allocations that reside
within the extent, such that the size class index is valid for all
extents that contain extant allocations, and invalid otherwise (mainly
to make debugging simpler).
2017-03-22 18:33:32 -07:00
Jason Evans
64e458f5cd Implement two-phase decay-based purging.
Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether.  Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.

The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
  stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}

This resolves #521.
2017-03-15 13:13:47 -07:00
Jason Evans
38a5bfc816 Move arena_t's purging field into arena_decay_t. 2017-03-15 13:13:47 -07:00
Jason Evans
765edd67b4 Refactor decay-related function parametrization.
Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on.  This
prepares for supporting both lazy and forced purging on different decay
schedules.
2017-03-15 13:13:47 -07:00
David Goldblatt
ee202efc79 Convert remaining arena_stats_t fields to atomics
These were all size_ts, so we have atomics support for them on all platforms, so
the conversion is straightforward.

Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
2017-03-13 18:22:33 -07:00
David Goldblatt
4fc2acf5ae Switch atomic uint64_ts in arena_stats_t to C11 atomics
I expect this to be the trickiest conversion we will see, since we want atomics
on 64-bit platforms, but are also always able to piggyback on some sort of
external synchronization on non-64 bit platforms.
2017-03-13 18:22:33 -07:00
Jason Evans
7cbcd2e2b7 Fix pages_purge_forced() to discard pages on non-Linux systems.
madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall
back to overlaying a new mapping.
2017-03-13 18:19:57 -07:00
David Goldblatt
21a68e2d22 Convert rtree code to use C11 atomics
In the process, I changed the implementation of rtree_elm_acquire so that it
won't even try to CAS if its initial read (getting the extent + lock bit)
indicates that the CAS is doomed to fail.  This can significantly improve
performance under contention.
2017-03-13 12:05:27 -07:00
Jason Evans
3a2b183d5f Convert arena_t's purging field to non-atomic bool.
The decay mutex already protects all accesses.
2017-03-10 10:14:30 -08:00
Jason Evans
75fddc786c Fix ATOMIC_{ACQUIRE,RELEASE,ACQ_REL} definitions. 2017-03-09 00:57:37 -08:00
Qi Wang
ec532e2c5c Implement per-CPU arena.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.

"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.

When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
2017-03-08 23:19:01 -08:00
Qi Wang
8721e19c04 Fix arena_prefork lock rank order for witness.
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
2017-03-08 23:07:27 -08:00
David Goldblatt
8adab26972 Convert extents_t's npages field to use C11-style atomics
In the process, we can do some strength reduction, changing the fetch-adds and
fetch-subs to be simple loads followed by stores, since the modifications all
occur while holding the mutex.
2017-03-08 21:27:09 -08:00
David Goldblatt
dafadce622 Reintroduce JEMALLOC_ATOMIC_U64
The C11 atomics backport removed this #define, which degraded atomic 64-bit
reads to require a lock even on platforms that support them.  This commit fixes
that.
2017-03-08 21:26:37 -08:00