Commit Graph

813 Commits

Author SHA1 Message Date
Qi Wang
855c127348 Remove the function alignment of prof_backtrace.
This was an attempt to avoid triggering slow path in libunwind, however turns
out to be ineffective.
2017-04-17 16:19:32 -07:00
Jason Evans
881fbf762f Prefer old/low extent_t structures during reuse.
Rather than using a LIFO queue to track available extent_t structures,
use a red-black tree, and always choose the oldest/lowest available
during reuse.
2017-04-17 14:47:45 -07:00
Jason Evans
76b35f4b2f Track extent structure serial number (esn) in extent_t.
This enables stable sorting of extent_t structures.
2017-04-17 14:47:45 -07:00
Jason Evans
69aa552809 Allocate increasingly large base blocks.
Limit the total number of base block by leveraging the exponential
size class sequence, similarly to extent_grow_retained().
2017-04-17 14:47:45 -07:00
Jason Evans
675701660c Update base_unmap() to match extent_dalloc_wrapper().
Reverse the order of forced versus lazy purging attempts in
base_unmap(), in order to match the order in extent_dalloc_wrapper(),
which was reversed by 64e458f5cd
(Implement two-phase decay-based purging.).
2017-04-17 14:47:45 -07:00
Qi Wang
3c9c41edb2 Improve rtree cache with a two-level cache design.
Two levels of rcache is implemented: a direct mapped cache as L1, combined with
a LRU cache as L2.  The L1 cache offers low cost on cache hit, but could suffer
collision under circumstances.  This is complemented by the L2 LRU cache, which
is slower on cache access (overhead from linear search + reordering), but solves
collison of L1 rather well.
2017-04-17 12:05:23 -07:00
Qi Wang
c2fcf9c2cf Switch to fine-grained reentrancy support.
Previously we had a general detection and support of reentrancy, at the cost of
having branches and inc / dec operations on fast paths.  To avoid taxing fast
paths, we move the reentrancy operations onto tsd slow state, and only modify
reentrancy level around external calls (that might trigger reentrancy).
2017-04-14 19:48:06 -07:00
Qi Wang
b348ba29bb Bundle 3 branches on fast path into tsd_state.
Added tsd_state_nominal_slow, which on fast path malloc() incorporates
tcache_enabled check, and on fast path free() bundles both malloc_slow and
tcache_enabled branches.
2017-04-14 16:58:08 -07:00
Qi Wang
ccfe68a916 Pass alloc_ctx down profiling path.
With this change, when profiling is enabled, we avoid doing redundant rtree
lookups. Also changed dalloc_atx_t to alloc_atx_t, as it's now used on
allocation path as well (to speed up profiling).
2017-04-12 13:55:39 -07:00
Qi Wang
f35213bae4 Pass dalloc_ctx down the sdalloc path.
This avoids redundant rtree lookups.
2017-04-12 13:55:39 -07:00
David Goldblatt
e709fae1d7 Header refactoring: move atomic.h out of the catch-all 2017-04-11 11:52:30 -07:00
David Goldblatt
743d940dc3 Header refactoring: Split up jemalloc_internal.h
This is a biggy.  jemalloc_internal.h has been doing multiple jobs for a while
now:
- The source of system-wide definitions.
- The catch-all include file.
- The module header file for jemalloc.c

This commit splits up this functionality.  The system-wide definitions
responsibility has moved to jemalloc_preamble.h.  The catch-all include file is
now jemalloc_internal_includes.h.  The module headers for jemalloc.c are now in
jemalloc_internal_[externs|inlines|types].h, just as they are for the other
modules.
2017-04-11 11:52:30 -07:00
David Goldblatt
2f00ce4da7 Header refactoring: break out ph.h dependencies 2017-04-11 11:52:30 -07:00
Qi Wang
bfa530b75b Pass dealloc_ctx down free() fast path.
This gets rid of the redundent rtree lookup down fast path.
2017-04-11 09:58:12 -07:00
Qi Wang
04ef218d87 Move reentrancy_level to the beginning of TSD. 2017-04-07 16:25:43 -07:00
David Goldblatt
b407a65401 Add basic reentrancy-checking support, and allow arena_new to reenter.
This checks whether or not we're reentrant using thread-local data, and, if we
are, moves certain internal allocations to use arena 0 (which should be properly
initialized after bootstrapping).

The immediate thing this allows is spinning up threads in arena_new, which will
enable spinning up background threads there.
2017-04-07 14:10:27 -07:00
David Goldblatt
0a0fcd3e6a Add hooking functionality
This allows us to hook chosen functions and do interesting things there (in
particular: reentrancy checking).
2017-04-07 14:10:27 -07:00
Qi Wang
36bd90b962 Optimizing TSD and thread cache layout.
1) Re-organize TSD so that frequently accessed fields are closer to the
beginning and more compact.  Assuming 64-bit, the first 2.5 cachelines now
contains everything needed on tcache fast path, expect the tcache struct itself.

2) Re-organize tcache and tbins.  Take lg_fill_div out of tbin, and reduce tbin
to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and
place tbins_small close to the beginning.
2017-04-07 14:06:17 -07:00
Qi Wang
4dec507546 Bypass witness_fork in TSD when !config_debug.
With the tcache change, we plan to leave some blank space when !config_debug
(unused tbins, witnesses) at the end of the tsd. Let's not touch the memory.
2017-04-07 14:06:17 -07:00
Qi Wang
0fba57e579 Get rid of tcache_enabled_t as we have runtime init support. 2017-04-07 10:42:29 -07:00
Qi Wang
fde3e20cc0 Integrate auto tcache into TSD.
The embedded tcache is initialized upon tsd initialization.  The avail arrays
for the tbins will be allocated / deallocated accordingly during init / cleanup.

With this change, the pointer to the auto tcache will always be available, as
long as we have access to the TSD.  tcache_available() (called in tcache_get())
is provided to check if we should use tcache.
2017-04-07 09:55:14 -07:00
David Goldblatt
074f2256ca Make prof's cum_gctx a C11-style atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
5dcc13b342 Make the mutex n_waiting_thds field a C11-style atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
492a941f49 Convert extent module to use C11-style atomcis 2017-04-05 16:25:37 -07:00
David Goldblatt
30d74db08e Convert accumbytes in prof_accum_t to C11 atomics, when possible 2017-04-05 16:25:37 -07:00
David Goldblatt
55d992c48c Make extent_dss use C11-style atomics 2017-04-05 16:25:37 -07:00
David Goldblatt
92aafb0efe Make base_t's extent_hooks field C11-atomic 2017-04-05 16:25:37 -07:00
David Goldblatt
56b72c7b17 Transition arena struct fields to C11 atomics 2017-04-05 16:25:37 -07:00
David Goldblatt
bc32ec3503 Move arena-tracking atomics in jemalloc.c to C11-style 2017-04-05 16:25:37 -07:00
David Goldblatt
7da04a6b09 Convert prng module to use C11-style atomics 2017-04-04 16:45:52 -07:00
Qi Wang
492e9f301e Make the tsd member init functions to take tsd_t * type. 2017-04-04 14:06:07 -07:00
Qi Wang
d3cda3423c Do proper cleanup for tsd_state_reincarnated.
Also enable arena_bind under non-nominal state, as the cleanup will be handled
correctly now.
2017-04-04 00:34:49 -07:00
Qi Wang
9ed84b0d45 Add init function support to tsd members.
This will facilitate embedding tcache into tsd, which will require proper
initialization cannot be done via the static initializer.  Make tsd->rtree_ctx
to be initialized via rtree_ctx_data_init().
2017-04-04 00:19:21 -07:00
Qi Wang
d4e98bc0b2 Lookup extent once per time during tcache_flush_small / _large.
Caching the extents on stack to avoid redundant looking up overhead.
2017-03-28 09:58:25 -07:00
Jason Evans
07f4f93434 Move arena_slab_data_t's nfree into extent_t's e_bits.
Compact extent_t to 128 bytes on 64-bit systems by moving
arena_slab_data_t's nfree into extent_t's e_bits.

Cacheline-align extent_t structures so that they always cross the
minimum number of cacheline boundaries.

Re-order extent_t fields such that all fields except the slab bitmap
(and overlaid heap profiling context pointer) are in the first
cacheline.

This resolves #461.
2017-03-27 22:43:39 -07:00
Jason Evans
7c00f04ff4 Remove BITMAP_USE_TREE.
Remove tree-structured bitmap support, in order to reduce complexity and
ease maintenance.  No bitmaps larger than 512 bits have been necessary
since before 4.0.0, and there is no current plan that would increase
maximum bitmap size.  Although tree-structured bitmaps were used on
32-bit platforms prior to this change, the overall benefits were
questionable (higher metadata overhead, higher bitmap modification cost,
marginally lower search cost).
2017-03-27 12:18:40 -07:00
Qi Wang
e6b074472e Force inline ifree to avoid function call costs on fast path.
Without ALWAYS_INLINE, sometimes ifree() gets compiled into its own function,
which adds overhead on the fast path.
2017-03-24 17:54:28 -07:00
Jason Evans
5d33233a5e Use a bitmap in extents_t to speed up search.
Rather than iteratively checking all sufficiently large heaps during
search, maintain and use a bitmap in order to skip empty heaps.
2017-03-24 17:52:46 -07:00
Jason Evans
c8021d01f6 Implement bitmap_ffu(), which finds the first unset bit. 2017-03-24 17:52:46 -07:00
Jason Evans
a832ebaee9 Use first fit layout policy instead of best fit.
For extents which do not delay coalescing, use first fit layout policy
rather than first-best fit layout policy.  This packs extents toward
older virtual memory mappings, but at the cost of higher search overhead
in the common case.

This resolves #711.
2017-03-24 17:52:46 -07:00
Qi Wang
362e356675 Profile per arena base mutex, instead of just a0. 2017-03-23 00:03:28 -07:00
Qi Wang
d3fde1c124 Refactor mutex profiling code with x-macros. 2017-03-23 00:03:28 -07:00
Qi Wang
f6698ec1e6 Switch to nstime_t for the time related fields in mutex profiling. 2017-03-23 00:03:28 -07:00
Qi Wang
74f78cafda Added custom mutex spin.
A fixed max spin count is used -- with benchmark results showing it
solves almost all problems. As the benchmark used was rather intense,
the upper bound could be a little bit high. However it should offer a
good tradeoff between spinning and blocking.
2017-03-23 00:03:28 -07:00
Qi Wang
20b8c70e9f Added extents_dirty / _muzzy mutexes, as well as decay_dirty / _muzzy. 2017-03-23 00:03:28 -07:00
Qi Wang
64c5f5c174 Added "stats.mutexes.reset" mallctl to reset all mutex stats.
Also switched from the term "lock" to "mutex".
2017-03-23 00:03:28 -07:00
Qi Wang
bd2006a41b Added JSON output for lock stats.
Also added option 'x' to malloc_stats() to bypass lock section.
2017-03-23 00:03:28 -07:00
Qi Wang
ca9074deff Added lock profiling and output for global locks (ctl, prof and base). 2017-03-23 00:03:28 -07:00
Qi Wang
0fb5c0e853 Add arena lock stats output. 2017-03-23 00:03:28 -07:00
Qi Wang
a4f176af57 Output bin lock profiling results to malloc_stats.
Two counters are included for the small bins: lock contention rate, and
max lock waiting time.
2017-03-23 00:03:28 -07:00