Commit Graph

1750 Commits

Author SHA1 Message Date
David Goldblatt
271a676dcd hpdata: early bailout for longest free range.
A number of common special cases allow us to stop iterating through an hpdata's
bitmap earlier rather than later.
2021-02-19 15:10:54 -08:00
David Goldblatt
d21d5b46b6 Edata: Move sn into its own field.
This lets the bins use a fragmentation avoidance policy that matches the HPA's
(without affecting the PAC).
2021-02-19 15:10:54 -08:00
David Goldblatt
fb327368db SEC: Expand option configurability.
This change pulls the SEC options into a struct, which simplifies their handling
across various modules (e.g. PA needs to forward on SEC options from the
malloc_conf string, but it doesn't really need to know their names).  While
we're here, make some of the fixed constants configurable, and unify naming from
the configuration options to the internals.
2021-02-19 15:10:54 -08:00
David Goldblatt
ce9386370a HPA: Implement batch allocation. 2021-02-19 15:10:54 -08:00
David Goldblatt
cdae6706a6 SEC: Use batch fills.
Currently, this doesn't help much, since no PAI implementation supports
flushing.  This will change in subsequent commits.
2021-02-19 15:10:54 -08:00
David Goldblatt
480f3b11cd Add a batch allocation interface to the PAI.
For now, no real allocator actually implements this interface; this will change
in subsequent diffs.
2021-02-19 15:10:54 -08:00
David Goldblatt
bf448d7a5a SEC: Reduce lock hold times.
Only flush a subset of extents during flushing, and drop the lock while doing
so.
2021-02-19 15:10:54 -08:00
David Goldblatt
1944ebbe7f HPA: Implement batch deallocation.
This saves O(n) mutex locks/unlocks during SEC flush.
2021-02-19 15:10:54 -08:00
David Goldblatt
f47b4c2cd8 PAI/SEC: Add a dalloc_batch function.
This lets the SEC flush all of its items in a single call, rather than flushing
everything at once.
2021-02-19 15:10:54 -08:00
Qi Wang
a11be50332 Implement opt.cache_oblivious.
Keep config.cache_oblivious for now to remain backward-compatible.
2021-02-11 11:32:01 -08:00
Jordan Rome
8c5e5f50a2 Fix stats for "tcache_max" (was "lg_tcache_max")
This opt was changed here: c8209150f9
and looks like this got missed.

Also update the write type to be unsigned.
2021-02-10 23:01:46 -08:00
Qi Wang
041145c272 Report the correct and wrong sizes on sized dealloc bug detection. 2021-02-08 14:42:27 -08:00
Qi Wang
f3b2668b32 Report the offending pointer on sized dealloc bug detection. 2021-02-08 14:42:27 -08:00
David Goldblatt
edbfe6912c Inline malloc fastpath into operator new.
This saves a small but non-negligible amount of CPU in C++ programs.
2021-02-08 14:17:47 -08:00
David Goldblatt
79f81a3732 HPA: Make dirty_mult configurable. 2021-02-04 20:58:31 -08:00
David Goldblatt
32dd153796 HPA: Make dehugification threshold configurable. 2021-02-04 20:58:31 -08:00
David Goldblatt
4790db15ed HPA: make the hugification threshold configurable. 2021-02-04 20:58:31 -08:00
David Goldblatt
b3df80bc79 Pull HPA options into a containing struct.
Currently that just means max_alloc, but we're about to add more.  While we're
touching these lines anyways, tweak things to be more in line with testing.
2021-02-04 20:58:31 -08:00
David Goldblatt
56e85c0e47 HPA: Use a whole-shard purging heuristic.
Previously, we used only hpdata-local information to decide whether to purge.
2021-02-04 20:58:31 -08:00
David Goldblatt
dc886e5608 hpdata: Return the number of pages to be purged.
We'll use this in the next commit.
2021-02-04 20:58:31 -08:00
David Goldblatt
9fd9c876bb psset: keep aggregate stats.
This will let us quickly query these stats to make purging decisions quickly.
2021-02-04 20:58:31 -08:00
David Goldblatt
da63f23e68 HPA: Track pending purges/hugifies in the psset.
This finishes the refactoring of the HPA/psset interactions the past few commits
have been building towards.

Rather than the HPA removing and then reinserting hpdatas, it simply begins
updates and ends them.  These updates can set flags on the hpdata that prevent
it from being returned for certain types of requests.  For example, it can call
hpdata_alloc_allowed_set(hpdata, false) during an update, at which point the
given hpdata will no longer be returned for psset_pick_alloc requests.

This has various of benefits:
- It maintains stats correctness during purges and hugifies.
- It allows simpler and more explicit concurrency control for the various
  special cases (e.g. allocations are disallowed during purge, but not during
  hugify).
- It lets allocations and deallocations avoid disturbing the purging and
  hugification orderings.  If an hpdata "loses its place" in one of the queues
  just do to an alloc / dalloc, it can result in pathological edge cases where
  very hot, very full hugepages never get hugified  (and cold extents on the
  same hugepage as hot ones never get purged).

The key benefit though is that tracking hpdatas to be purged / hugified in a
principled way will let us do delayed purging and hugification.  Eventually this
will let us move these operations to background threads, but in the short term
the benefit is that it will let us have global purging policies (e.g. purge when
the entire arena has too many dirty pages, rather than any particular hugepage).
2021-02-04 20:58:31 -08:00
David Goldblatt
0ea3d6307c CTL, Stats: report HPA empty slab stats. 2021-02-04 20:58:31 -08:00
David Goldblatt
bf64557ed6 Move empty slab tracking to the psset.
We're moving towards a world in which purging decisions are less rigidly
enforced at a single-hugepage level.  In that world, it makes sense to keep
around some hpdatas which are not completely purged, in which case we'll need to
track them.
2021-02-04 20:58:31 -08:00
David Goldblatt
99fc0717e6 psset: Reconceptualize insertion/removal.
Really, this isn't a functional change, just a naming change.  We start thinking
of pageslabs as being always in the psset.  What we used to think of as removal
is now thought of as being in the psset, but in the process of being updated
(and therefore, unavalable for serving new allocations).

This is in preparation of subsequent changes to support deferred purging;
allocations will still be in the psset for the purposes of choosing when to
purge, but not for purposes of allocation/deallocation.
2021-02-04 20:58:31 -08:00
David Goldblatt
061cabb712 HPA stats: report retained instead of inactive.
This more closely maps to the PAC.
2021-02-04 20:58:31 -08:00
David Goldblatt
d3e5ea03c5 HPA: Track dirty stats. 2021-02-04 20:58:31 -08:00
David Goldblatt
68a1666e91 hpdata: Rename "dirty" to "touched".
This matches the usage in the rest of the codebase.
2021-02-04 20:58:31 -08:00
David Goldblatt
be0d7a53f3 HPA: Don't track inactive pages.
This is really only useful for human consumption.  Correspondingly, emit it only
in the human-readable stats, and let everybody else compute from the hugepage
size and nactive.
2021-02-04 20:58:31 -08:00
David Goldblatt
55e0f60ca1 psset stats: Simplify handling.
We can treat the huge and nonhuge cases uniformly using huge state as an array
index.
2021-02-04 20:58:31 -08:00
David Goldblatt
94cd9444c5 HPA: Some minor reformattings. 2021-02-04 20:58:31 -08:00
David Goldblatt
b25ee5d88e HPA: Add purge stats. 2021-02-04 20:58:31 -08:00
David Goldblatt
746ea3de6f HPA stats: Allow some derived stats.
However, we put them in their own struct, to avoid the messiness that the arena
has (mixing derived and non-derived stats in the arena_stats_t).
2021-02-04 20:58:31 -08:00
David Goldblatt
30b9e8162b HPA: Generalize purging.
Previously, we would purge a hugepage only when it's completely empty.  With
this change, we can purge even when only partially empty.  Although the
heuristic here is still fairly primitive, this infrastructure can scale to
become more advanced.
2021-02-04 20:58:31 -08:00
David Goldblatt
70692cfb13 hpdata: Add state changing helpers.
We're about to allow hugepage subextent purging; get as much of our metadata
handling ready as possible.
2021-02-04 20:58:31 -08:00
David Goldblatt
2ae966222f hpdata: track per-page dirty state. 2021-02-04 20:58:31 -08:00
David Goldblatt
ff4086aa6b hpdata: count active pages instead of free ones.
This will be more consistent with later naming choices.
2021-02-04 20:58:31 -08:00
David Goldblatt
20140629b4 Bin: Move stats closer to the mutex.
This is a slight cache locality optimization.
2021-02-04 14:10:43 -08:00
David Goldblatt
c259323ab3 Use ticker_geom_t for arena tcache decay. 2021-02-04 14:10:43 -08:00
David Goldblatt
8edfc5b170 Add ticker_geom_t.
This lets a single ticker object drive events across a large number of different
tick streams while sharing state.
2021-02-04 14:10:43 -08:00
David Goldblatt
3967329813 Arena: share bin offsets in a global.
This saves us a cache miss when lookup up the arena bin offset in a remote
arena during tcache flush.  All arenas share the base offset, and so we don't
need to look it up repeatedly for each arena.  Secondarily, it shaves 288 bytes
off the arena on, e.g., x86-64.
2021-02-04 14:10:43 -08:00
David Goldblatt
2fcbd18115 Cache bin: Don't reverse flush order.
The items we pick to flush matter a lot, but the order in which they get flushed
doesn't; just use forward scans.  This simplifies the accessing code, both in
terms of the C and the generated assembly (i.e. this speeds up the flush
pathways).
2021-02-04 14:10:43 -08:00
David Goldblatt
4c46e11365 Cache an arena's index in the arena.
This saves us a pointer hop down some perf-sensitive paths.
2021-02-04 14:10:43 -08:00
David Goldblatt
229994a204 Tcache flush: keep common path state in registers.
By carefully force-inlining the division constants and the operation sum count,
we can eliminate redundant operations in the arena-level dalloc function.  Do
so.
2021-02-04 14:10:43 -08:00
David Goldblatt
31a629c3de Tcache flush: prefetch edata contents.
This frontloads more of the miss latency.  It also moves it to a pathway where
we have not yet acquired any locks, so that it should (hopefully) reduce hold
times.
2021-02-04 14:10:43 -08:00
David Goldblatt
9f9247a62e Tcache fluhing: increase cache miss parallelism.
In practice, many rtree_leaf_elm accesses are cache misses.  By restructuring,
we can make it more likely that these misses occur without blocking us from
starting later lookups, taking more of those misses in parallel.
2021-02-04 14:10:43 -08:00
David Goldblatt
181ba7fd4d Tcache flush: Add an emap "batch lookup" path.
For now this is a no-op; but the interface is a little more flexible for our
purposes.
2021-02-04 14:10:43 -08:00
David Goldblatt
c007c537ff Tcache flush: Unify edata lookup path. 2021-02-04 14:10:43 -08:00
David CARLIER
35a8552605 Mac OS: Tag mapped pages.
This can be used to help profiling tools (e.g. vmmap) identify the
sources of mappings more specifically.
2021-02-03 15:05:53 -08:00
Yinan Zhang
f6699803e2 Fix duration in prof log 2021-01-25 16:38:38 -08:00
Azat Khuzhin
a943172b73 Add runtime detection for MADV_DONTNEED zeroes pages (mostly for qemu)
qemu does not support this, yet [1], and you can get very tricky assert
if you will run program with jemalloc in use under qemu:

    <jemalloc>: ../contrib/jemalloc/src/extent.c:1195: Failed assertion: "p[i] == 0"

  [1]: https://patchwork.kernel.org/patch/10576637/

Here is a simple example that shows the problem [2]:

    // Gist to check possible issues with MADV_DONTNEED
    // For example it does not supported by qemu user
    // There is a patch for this [1], but it hasn't been applied.
    //   [1]: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05422.html

    #include <sys/mman.h>
    #include <stdio.h>
    #include <stddef.h>
    #include <assert.h>
    #include <string.h>

    int main(int argc, char **argv)
    {
        void *addr = mmap(NULL, 1<<16, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
        if (addr == MAP_FAILED) {
            perror("mmap");
            return 1;
        }
        memset(addr, 'A', 1<<16);

        if (!madvise(addr, 1<<16, MADV_DONTNEED)) {
            puts("MADV_DONTNEED does not return error. Check memory.");
            for (int i = 0; i < 1<<16; ++i) {
                assert(((unsigned char *)addr)[i] == 0);
            }
        } else {
            perror("madvise");
        }

        if (munmap(addr, 1<<16)) {
            perror("munmap");
            return 1;
        }

        return 0;
    }

  ### unpatched qemu

      $ qemu-x86_64-static /tmp/test-MADV_DONTNEED
      MADV_DONTNEED does not return error. Check memory.
      test-MADV_DONTNEED: /tmp/test-MADV_DONTNEED.c:19: main: Assertion `((unsigned char *)addr)[i] == 0' failed.
      qemu: uncaught target signal 6 (Aborted) - core dumped
      Aborted (core dumped)

  ### patched qemu (by returning ENOSYS error)

      $ qemu-x86_64 /tmp/test-MADV_DONTNEED
      madvise: Success

  ### patch for qemu to return ENOSYS

      diff --git a/linux-user/syscall.c b/linux-user/syscall.c
      index 897d20c076..5540792e0e 100644
      --- a/linux-user/syscall.c
      +++ b/linux-user/syscall.c
      @@ -11775,7 +11775,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
                  turns private file-backed mappings into anonymous mappings.
                  This will break MADV_DONTNEED.
                  This is a hint, so ignoring and returning success is ok.  */
      -        return 0;
      +        return ENOSYS;
       #endif
       #ifdef TARGET_NR_fcntl64
           case TARGET_NR_fcntl64:

  [2]: https://gist.github.com/azat/12ba2c825b710653ece34dba7f926ece

v2:
- review fixes
- add opt_dont_trust_madvise
v3:
- review fixes
- rename opt_dont_trust_madvise to opt_trust_madvise
2021-01-20 20:08:30 -08:00
David Goldblatt
a011c4c22d cache_bin: Separate out local and remote accesses.
This fixes an incorrect debug-mode assert:
- T1 starts an arena stats update and reads stack_head from another thread's
  cache bin, when that cache bin has 1 item in it.
- T2 allocates from that cache bin.  The cache_bin's stack_head now points to a
  NULL pointer, since the cache bin is empty.
- T1 Re-reads the cache_bin's stack_head to perform an assertion check (since it
  previously saw that the bin was empty, whatever stack_head points to should be
  non-NULL).
2021-01-08 14:18:08 -08:00
Yinan Zhang
14d689c0f9 Add prof stats mutex stats 2021-01-07 20:39:49 -08:00
Yinan Zhang
9f71b5779b Output prof stats in stats print 2021-01-07 20:39:49 -08:00
Yinan Zhang
1f1a0231ed Split macros for initializing stats headers 2021-01-07 20:39:49 -08:00
Yinan Zhang
54f3351f1f Add mallctl for prof stats fetching 2021-01-07 20:39:49 -08:00
Yinan Zhang
40fa4d29d3 Track per size class internal fragmentation 2021-01-07 20:39:49 -08:00
Yinan Zhang
afa489c3c5 Record request size in prof info 2021-01-07 20:39:49 -08:00
David Goldblatt
f9bb8dedef Un-force-inline do_rallocx.
The additional overhead of the function-call setup and flags checking is
relatively small, but costs us the replication of the entire realloc pathway in
terms of size.
2021-01-04 14:55:49 -08:00
David Goldblatt
a9fa2defdb Add JEMALLOC_COLD, and mark some functions cold.
This hints to the compiler that it should care more about space than CPU (among
other things).  In cases where the compiler lacks profile-guided information,
this can be a substantial space savings.

For now, we mark the mallctl or atexit driven profiling and stats functions that
take up the most space.
2021-01-04 14:55:49 -08:00
David Goldblatt
5d8e70ab26 prof_recent: cassert(config_prof) more often.
This tells the compiler that these functions are never called, which lets them
be optimized away in builds where profiling is disabled.
2021-01-04 14:55:49 -08:00
David Goldblatt
83cad746ae prof_log: cassert(config_prof) in public functions
This lets the compiler infer that the code is dead in builds where profiling is
enabled, saving on space there.
2021-01-04 14:55:49 -08:00
David Goldblatt
526180b76d Extent.c: Avoid an rtree NULL-check.
The edge case in which pages_map returns (void *)PAGE can trigger an incorrect
assertion failure.  Avoid it.
2021-01-04 14:50:49 -08:00
Yinan Zhang
b35ac00d58 Do not bump to large size for page aligned request 2020-12-29 17:09:58 -08:00
Yinan Zhang
8a56d6b636 Add last-N mutex stats 2020-12-29 09:44:19 -08:00
Yinan Zhang
22d62d8cbd Handle ending gap properly for HPA stats 2020-12-18 16:40:57 -08:00
Yinan Zhang
6c5a3a24dd Omit bin stats rows with no data 2020-12-18 16:40:57 -08:00
Yinan Zhang
ea013d8fa4 Enforce realloc sizing stability 2020-12-18 11:41:52 -08:00
Yinan Zhang
74bd63b203 Optimize stats print using partial name-to-mib 2020-12-18 10:39:58 -08:00
Yinan Zhang
4557c0a67d Enable ctl on partial mib and partial name 2020-12-18 10:39:58 -08:00
Yinan Zhang
006dd0414e Add partial name-to-mib functionality 2020-12-18 10:39:58 -08:00
Yinan Zhang
f2e1a5be77 Do not fail on partial ctl path for ctl_nametomib()
We do not fail on partial ctl path when the given `mib` array is
shorter than the given name, and we should keep the behavior the
same in the reverse case, which I feel is also the more natural way.
2020-12-18 10:39:58 -08:00
Yinan Zhang
6ab181d2b7 Extract node lookup given mib input 2020-12-18 10:39:58 -08:00
Yinan Zhang
3a627b9674 No need to record all nodes in ctl_lookup() 2020-12-18 10:39:58 -08:00
Yinan Zhang
91e006c4c2 Enable ctl_lookup() to start from arbitrary node 2020-12-18 10:39:58 -08:00
Jin Qian
4e3fe218e9 Use posix_madvise to purge pages when available 2020-12-18 10:05:59 -08:00
David Goldblatt
1e3b8636ff HPA: Remove unused malloc_conf options. 2020-12-08 12:10:48 -08:00
Aditya Kumar
9522ae41d6 Move n_search outside of assert as reported by static analyzer 2020-12-07 06:49:27 -08:00
David Goldblatt
a559caf74a hpdata: Strengthen assertions.
Now that we have flat bitmap bit counting functions, we can easily assert that
nfree is always correct.  While we're tightening up this code, enforce
consistency on API boundaries as well.
2020-12-07 06:21:08 -08:00
David Goldblatt
3ed0b4e8a3 HPA: Add an nevictions counter.
I.e. the number of times we've purged a hugepage-sized region.
2020-12-07 06:21:08 -08:00
David Goldblatt
fffcefed33 malloc_conf: Clarify HPA options. 2020-12-07 06:21:08 -08:00
David Goldblatt
f7cf23aa4d psset: Relegate alloc/dalloc to test code.
This is no longer part of the "core" functionality; we only need the stub
implementations as an end-to-end test of hpdata + psset interactions when
metadata is being modified.  Treat them accordingly.
2020-12-07 06:21:08 -08:00
David Goldblatt
f9299ca572 HPA: Use psset fit/insert/remove.
This will let us remove alloc_new and alloc_reuse functions from the psset.
2020-12-07 06:21:08 -08:00
David Goldblatt
0971e1e4e3 hpdata: Use addr/size instead of begin/npages.
This is easier for the users of the hpdata.
2020-12-07 06:21:08 -08:00
David Goldblatt
5228d869ee psset: Use fit/insert/remove as basis functions.
All other functionality can be implemented in terms of these; doing so (while
retaining the same API) will be convenient for subsequent refactors.
2020-12-07 06:21:08 -08:00
David Goldblatt
089f8fa442 Move hpdata bitmap logic out of the psset. 2020-12-07 06:21:08 -08:00
David Goldblatt
ca30b5db2b Introduce hpdata_t.
Using an edata_t both for hugepages and the allocations within those hugepages
was convenient at first, but has outlived its usefulness.  Representing
hugepages explicitly, with their own data structure, will make future
development easier.
2020-12-07 06:21:08 -08:00
David Goldblatt
43af63fff4 HPA: Manage whole hugepages at a time.
This redesigns the HPA implementation to allow us to manage hugepages all at
once, locally, without relying on a global fallback.
2020-12-07 06:21:08 -08:00
David Goldblatt
c1b2a77933 psset: Move in stats.
A later change will benefit from having these functions pulled into a
psset-module set of functions.
2020-12-07 06:21:08 -08:00
David Goldblatt
d0a991d47b psset: Add insert/remove functions.
These will allow us to (for instance) move pageslabs from a psset dedicated to
not-yet-hugeified pages to one dedicated to hugeified ones.
2020-12-07 06:21:08 -08:00
David Goldblatt
d438296b1f narenas_ratio: Accept fractional values.
With recent scalability improvements to the HPA, we're experimenting with much
lower arena counts; this gets annoying when trying to test across different
hardware configurations using only the narenas setting.
2020-12-04 23:48:19 -08:00
David Goldblatt
ecd39418ac Add fxp: A fixed-point math library.
This will be used in the next commit to allow non-integer values for
narenas_ratio.
2020-12-04 23:48:19 -08:00
David Carlier
520b75fa2d utrace support with label based signature. 2020-11-30 11:43:00 -08:00
Yinan Zhang
92e189be8b Add some comments to the batch allocation logic flow 2020-11-16 20:58:01 -08:00
Yinan Zhang
d96e4525ad Route batch allocation of small batch size to tcache 2020-11-16 20:58:01 -08:00
Yinan Zhang
566c4a8594 Slight changes to cache bin internal functions 2020-11-16 20:58:01 -08:00
Yinan Zhang
9545c2cd36 Add sample interval to prof last-N dump 2020-11-13 15:33:27 -08:00
David Goldblatt
cf2549a149 Add a per-arena oversize_threshold.
This can let manual arenas trade off memory and CPU the way auto arenas do.
2020-11-13 13:45:35 -08:00
David Goldblatt
4ca3d91e96 Rename geom_grow -> exp_grow.
This was promised in the review of the introduction of geom_grow, but would have
been painful to do there because of the series that introduced it.  Now that
those are comitted, renaming is easier.
2020-11-13 13:42:33 -08:00
David Goldblatt
b4c37a6e81 Rename edata_tree_t -> edata_avail_t.
This isn't a tree any more, and it mildly irritates me any time I see it.
2020-11-13 13:42:11 -08:00