Commit Graph

102 Commits

Author SHA1 Message Date
Jason Evans
a4e1888d1a Simplify extent_node_t and add extent_node_init(). 2015-02-17 15:13:52 -08:00
Jason Evans
ee41ad409a Integrate whole chunks into unused dirty page purging machinery.
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs.  Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty.  Preferrentially
allocate dirty chunks.  When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).

Refactor extent_node_t to provide accessor functions.
2015-02-16 21:02:17 -08:00
Jason Evans
2195ba4e1f Normalize *_link and link_* fields to all be *_link. 2015-02-15 16:43:52 -08:00
Jason Evans
88fef7ceda Refactor huge_*() calls into arena internals.
Make redirects to the huge_*() API the arena code's responsibility,
since arenas now take responsibility for all allocation sizes.
2015-02-12 14:06:37 -08:00
Jason Evans
cbf3a6d703 Move centralized chunk management into arenas.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.

Add chunk node caching to arenas, in order to avoid contention on the
base allocator.

Use chunks_rtree to look up huge allocations rather than a red-black
tree.  Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).

Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled.  The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.

Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering.  These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.

Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
2015-02-12 00:15:56 -08:00
Jason Evans
1cb181ed63 Implement explicit tcache support.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.

Add the tcache.create, tcache.flush, and tcache.destroy mallctls.

This resolves #145.
2015-02-09 17:44:48 -08:00
Mike Hommey
6505733012 Make opt.lg_dirty_mult work as documented
The documentation for opt.lg_dirty_mult says:
    Per-arena minimum ratio (log base 2) of active to dirty
    pages.  Some dirty unused pages may be allowed to accumulate,
    within the limit set by the ratio (or one chunk worth of dirty
    pages, whichever is greater) (...)

The restriction in parentheses currently doesn't happen. This makes
jemalloc aggressively madvise(), which in turns increases the amount
of page faults significantly.

For instance, this resulted in several(!) hundred(!) milliseconds
startup regression on Firefox for Android.

This may require further tweaking, but starting with actually doing
what the documentation says is a good start.
2015-02-04 07:16:55 +09:00
Jason Evans
4581b97809 Implement metadata statistics.
There are three categories of metadata:

- Base allocations are used for bootstrap-sensitive internal allocator
  data structures.
- Arena chunk headers comprise pages which track the states of the
  non-metadata pages.
- Internal allocations differ from application-originated allocations
  in that they are for internal use, and that they are omitted from heap
  profiles.

The metadata statistics comprise the metadata categories as follows:

- stats.metadata: All metadata -- base + arena chunk headers + internal
  allocations.
- stats.arenas.<i>.metadata.mapped: Arena chunk headers.
- stats.arenas.<i>.metadata.allocated: Internal allocations.  This is
  reported separately from the other metadata statistics because it
  overlaps with the allocated and active statistics, whereas the other
  metadata statistics do not.

Base allocations are not reported separately, though their magnitude can
be computed by subtracting the arena-specific metadata.

This resolves #163.
2015-01-23 23:34:43 -08:00
Guilherme Goncalves
9c6a8d3b0c Move variable declaration to the top its block for MSVC compatibility. 2014-12-17 14:46:35 -02:00
Guilherme Goncalves
2c5cb613df Introduce two new modes of junk filling: "alloc" and "free".
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.

This resolves #172.
2014-12-14 17:07:26 -08:00
Jason Evans
e12eaf93dc Style and spelling fixes. 2014-12-08 16:34:04 -08:00
Jason Evans
d49cb68b9e Fix more pointer arithmetic undefined behavior.
Reported by Guilherme Gonçalves.

This resolves #166.
2014-11-17 10:31:59 -08:00
Jason Evans
2012d5a560 Fix pointer arithmetic undefined behavior.
Reported by Denis Denisov.
2014-11-17 09:54:49 -08:00
Jason Evans
2b2f6dc1e4 Disable arena_dirty_count() validation. 2014-11-01 02:29:10 -07:00
Daniel Micay
809b0ac391 mark huge allocations as unlikely
This cleans up the fast path a bit more by moving away more code.
2014-10-30 17:06:38 -07:00
Jason Evans
af1f592763 Use JEMALLOC_INLINE_C everywhere it's appropriate. 2014-10-30 16:38:08 -07:00
Daniel Micay
a9ea10d27c use sized deallocation internally for ralloc
The size of the source allocation is known at this point, so reading the
chunk header can be avoided for the small size class fast path. This is
not very useful right now, but it provides a significant performance
boost with an alternate ralloc entry point taking the old size.
2014-10-16 15:39:59 -04:00
Jason Evans
9b41ac909f Fix huge allocation statistics. 2014-10-14 22:20:00 -07:00
Jason Evans
3c4d92e82a Add per size class huge allocation statistics.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
  stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
  stats.arenas.<i>.hchunks.<j>.ndalloc,
  stats.arenas.<i>.hchunks.<j>.nrequests, and
  stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
2014-10-12 23:02:10 -07:00
Jason Evans
381c23dd9d Remove arena_dalloc_bin_run() clean page preservation.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation.  This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.

Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t).  These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
2014-10-10 23:01:03 -07:00
Jason Evans
fc0b3b7383 Add configure options.
Add:
  --with-lg-page
  --with-lg-page-sizes
  --with-lg-size-class-group
  --with-lg-quantum

Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.

Fix various edge conditions exposed by the configure options.
2014-10-09 22:44:37 -07:00
Jason Evans
8bb3198f72 Refactor/fix arenas manipulation.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind].  Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.

Add a tsd-based arenas_cache, which amortizes arenas reads.  This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.

Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).

Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).

Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used.  Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
2014-10-07 23:14:57 -07:00
Jason Evans
155bfa7da1 Normalize size classes.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes.  Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB).  Large size classes range from [16 KiB .. 4 MiB).  Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).
2014-10-06 01:45:13 -07:00
Daniel Micay
a95018ee81 Attempt to expand huge allocations in-place.
This adds support for expanding huge allocations in-place by requesting
memory at a specific address from the chunk allocator.

It's currently only implemented for the chunk recycling path, although
in theory it could also be done by optimistically allocating new chunks.
On Linux, it could attempt an in-place mremap. However, that won't work
in practice since the heap is grown downwards and memory is not unmapped
(in a normal build, at least).

Repeated vector reallocation micro-benchmark:

    #include <string.h>
    #include <stdlib.h>

    int main(void) {
        for (size_t i = 0; i < 100; i++) {
            void *ptr = NULL;
            size_t old_size = 0;
            for (size_t size = 4; size < (1 << 30); size *= 2) {
                ptr = realloc(ptr, size);
                if (!ptr) return 1;
                memset(ptr + old_size, 0xff, size - old_size);
                old_size = size;
            }
            free(ptr);
        }
    }

The glibc allocator fails to do any in-place reallocations on this
benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it
elides the cost of copies via mremap, which is currently not something
that jemalloc can use.

With this improvement, jemalloc still fails to do any in-place huge
reallocations for the first outer loop, but then succeeds 100% of the
time for the remaining 99 iterations. The time spent doing allocations
and copies drops down to under 5%, with nearly all of it spent doing
purging + faulting (when huge pages are disabled) and the array memset.

An improved mremap API (MREMAP_RETAIN - #138) would be far more general
but this is a portable optimization and would still be useful on Linux
for xallocx.

Numbers with transparent huge pages enabled:

glibc (copies elided via MREMAP_MAYMOVE): 8.471s

jemalloc: 17.816s
jemalloc + no-op madvise: 13.236s

jemalloc + this commit: 6.787s
jemalloc + this commit + no-op madvise: 6.144s

Numbers with transparent huge pages disabled:

glibc (copies elided via MREMAP_MAYMOVE): 15.403s

jemalloc: 39.456s
jemalloc + no-op madvise: 12.768s

jemalloc + this commit: 15.534s
jemalloc + this commit + no-op madvise: 6.354s

Closes #137
2014-10-05 14:47:01 -07:00
Jason Evans
f11a6776c7 Fix OOM-related regression in arena_tcache_fill_small().
Fix an OOM-related regression in arena_tcache_fill_small() that caused
cache corruption that would almost certainly expose the application to
undefined behavior, usually in the form of an allocation request
returning an already-allocated region, or somewhat less likely, a freed
region that had already been returned to the arena, thus making it
available to the arena for any purpose.

This regression was introduced by
9c43c13a35 (Reverse tcache fill order.),
and was present in all releases from 2.2.0 through 3.6.0.

This resolves #98.
2014-10-05 13:05:10 -07:00
Jason Evans
551ebc4364 Convert to uniform style: cond == false --> !cond 2014-10-03 10:16:09 -07:00
Jason Evans
0c5dd03e88 Move small run metadata into the arena chunk header.
Move small run metadata into the arena chunk header, with multiple
expected benefits:
- Lower run fragmentation due to reduced run sizes; runs are more likely
  to completely drain when there are fewer total regions.
- Improved cache behavior.  Prior to this change, run headers were
  always page-aligned, which put extra pressure on some CPU cache sets.
  The degree to which this was a problem was hardware dependent, but it
  likely hurt some even for the most advanced modern hardware.
- Buffer overruns/underruns are less likely to corrupt allocator
  metadata.
- Size classes between 4 KiB and 16 KiB become reasonable to support
  without any special handling, and the runs are small enough that dirty
  unused pages aren't a significant concern.
2014-09-29 01:31:39 -07:00
Jason Evans
5460aa6f66 Convert all tsd variables to reside in a single tsd structure. 2014-09-23 02:36:08 -07:00
Jason Evans
9c640bfdd4 Apply likely()/unlikely() to allocation/deallocation fast paths. 2014-09-11 17:01:58 -07:00
Jason Evans
b718cf77e9 Optimize [nmd]alloc() fast paths.
Optimize [nmd]alloc() fast paths such that the (flags == 0) case is
streamlined, flags decoding only happens to the minimum degree
necessary, and no conditionals are repeated.
2014-09-07 14:40:19 -07:00
Qinfan Wu
ff6a31d3b9 Refactor chunk map.
Break the chunk map into two separate arrays, in order to improve cache
locality. This is related to issue #23.
2014-09-04 22:22:52 -07:00
Jason Evans
070b3c3fbd Fix and refactor runs_dirty-based purging.
Fix runs_dirty-based purging to also purge dirty pages in the spare
chunk.

Refactor runs_dirty manipulation into arena_dirty_{insert,remove}(), and
move the arena->ndirty accounting into those functions.

Remove the u.ql_link field from arena_chunk_map_t, and get rid of the
enclosing union for u.rb_link, since only rb_link remains.

Remove the ndirty field from arena_chunk_t.
2014-08-14 14:45:58 -07:00
Qinfan Wu
e8a2fd83a2 arena->npurgatory is no longer needed since we drop arena's lock
after stashing all the purgeable runs.
2014-08-12 09:50:01 -07:00
Qinfan Wu
90737fcda1 Remove chunks_dirty tree, nruns_avail and nruns_adjac since we no
longer need to maintain the tree for dirty page purging.
2014-08-12 09:50:00 -07:00
Qinfan Wu
e970800c78 Purge dirty pages from the beginning of the dirty list. 2014-08-12 09:50:00 -07:00
Qinfan Wu
a244e5078e Add dirty page counting for debug 2014-08-12 09:50:00 -07:00
Qinfan Wu
04d60a132b Maintain all the dirty runs in a linked list for each arena 2014-08-12 09:50:00 -07:00
Jason Evans
1522937e9c Fix the cactive statistic.
Fix the cactive statistic to decrease (rather than increase) when active
memory decreases.  This regression was introduced by
aa5113b1fd (Refactor overly large/complex
functions) and first released in 3.5.0.
2014-08-06 23:43:39 -07:00
Qinfan Wu
ea73eb8f3e Reintroduce the comment that was removed in f9ff603. 2014-08-06 16:43:01 -07:00
Qinfan Wu
55c9aa1038 Fix the bug that causes not allocating free run with lowest address. 2014-08-06 16:10:08 -07:00
Richard Diamond
9c3a10fdf6 Try to use __builtin_ffsl if ffsl is unavailable.
Some platforms (like those using Newlib) don't have ffs/ffsl.  This
commit adds a check to configure.ac for __builtin_ffsl if ffsl isn't
found.  __builtin_ffsl performs the same function as ffsl, and has the
added benefit of being available on any platform utilizing
Gcc-compatible compiler.

This change does not address the used of ffs in the MALLOCX_ARENA()
macro.
2014-06-02 07:44:50 -07:00
Jason Evans
d04047cc29 Add size class computation capability.
Add size class computation capability, currently used only as validation
of the size class lookup tables.  Generalize the size class spacing used
for bins, for eventual use throughout the full range of allocation
sizes.
2014-05-28 21:06:46 -07:00
Jason Evans
e2deab7a75 Refactor huge allocation to be managed by arenas.
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation).  This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.

Remove the top level huge stats, and replace them with per arena huge
stats.

Normalize function names and types to *dalloc* (some were *dealloc*).

Remove the --enable-mremap option.  As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance.  The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.
2014-05-15 22:36:41 -07:00
aravind
fb7fe50a88 Add support for user-specified chunk allocators/deallocators.
Add new mallctl endpoints "arena<i>.chunk.alloc" and
"arena<i>.chunk.dealloc" to allow userspace to configure
jemalloc's chunk allocator and deallocator on a per-arena
basis.
2014-05-12 10:46:03 -07:00
Jason Evans
3541a904d6 Refactor small_size2bin and small_bin2size.
Refactor small_size2bin and small_bin2size to be inline functions rather
than directly accessed arrays.
2014-04-16 17:14:33 -07:00
Jason Evans
3e3caf03af Merge pull request #73 from bmaurer/smallmalloc
Smaller malloc hot path
2014-04-16 16:33:21 -07:00
Ben Maurer
021136ce4d Create a const array with only a small bin to size map 2014-04-16 14:31:24 -07:00
Jason Evans
bd87b01999 Optimize Valgrind integration.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.

Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.

Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
2014-04-15 16:49:57 -07:00
Jason Evans
4d434adb14 Make dss non-optional, and fix an "arena.<i>.dss" mallctl bug.
Make dss non-optional on all platforms which support sbrk(2).

Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
"secondary" precedence is specified, but sbrk(2) is not supported.
2014-04-15 12:09:48 -07:00
Jason Evans
9b0cbf0850 Remove support for non-prof-promote heap profiling metadata.
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run.  In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.

Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
2014-04-11 14:24:51 -07:00