Commit Graph

400 Commits

Author SHA1 Message Date
Jason Evans
6591ff09d8 Fix arena_dalloc() performance regression.
Take into account large_pad when computing whether to pass the
deallocation request to tcache_dalloc_large(), so that the largest
cacheable size makes it back to tcache.  This regression was introduced
by 8a03cf039c (Implement cache index
randomization for large allocations.).
2015-05-19 17:44:45 -07:00
Jason Evans
fd5f9e43c3 Avoid atomic operations for dependent rtree reads. 2015-05-15 17:02:30 -07:00
Jason Evans
c451831264 Fix type punning in calls to atomic operation functions. 2015-05-07 22:35:40 -07:00
Jason Evans
8a03cf039c Implement cache index randomization for large allocations.
Extract szad size quantization into {extent,run}_quantize(), and .
quantize szad run sizes to the union of valid small region run sizes and
large run sizes.

Refactor iteration in arena_run_first_fit() to use
run_quantize{,_first,_next(), and add support for padded large runs.

For large allocations that have no specified alignment constraints,
compute a pseudo-random offset from the beginning of the first backing
page that is a multiple of the cache line size.  Under typical
configurations with 4-KiB pages and 64-byte cache lines this results in
a uniform distribution among 64 page boundary offsets.

Add the --disable-cache-oblivious option, primarily intended for
performance testing.

This resolves #13.
2015-05-06 13:27:39 -07:00
Jason Evans
562d266511 Add the "stats.arenas.<i>.lg_dirty_mult" mallctl. 2015-03-24 16:41:38 -07:00
Jason Evans
4acd75a694 Add the "stats.allocated" mallctl. 2015-03-23 17:26:53 -07:00
Igor Podlesny
8ad6bf360f Fix indentation inconsistencies. 2015-03-22 00:09:04 -07:00
Jason Evans
e0a08a1496 Restore --enable-ivsalloc.
However, unlike before it was removed do not force --enable-ivsalloc
when Darwin zone allocator integration is enabled, since the zone
allocator code uses ivsalloc() regardless of whether
malloc_usable_size() and sallocx() do.

This resolves #211.
2015-03-18 21:06:58 -07:00
Jason Evans
8d6a3e8321 Implement dynamic per arena control over dirty page purging.
Add mallctls:
- arenas.lg_dirty_mult is initialized via opt.lg_dirty_mult, and can be
  modified to change the initial lg_dirty_mult setting for newly created
  arenas.
- arena.<i>.lg_dirty_mult controls an individual arena's dirty page
  purging threshold, and synchronously triggers any purging that may be
  necessary to maintain the constraint.
- arena.<i>.chunk.purge allows the per arena dirty page purging function
  to be replaced.

This resolves #93.
2015-03-18 18:55:33 -07:00
Mike Hommey
c9db461ffb Use InterlockedCompareExchange instead of non-existing InterlockedCompareExchange32 2015-03-17 12:09:30 +09:00
Jason Evans
04211e2266 Fix heap profiling regressions.
Remove the prof_tctx_state_destroying transitory state and instead add
the tctx_uid field, so that the tuple <thr_uid, tctx_uid> uniquely
identifies a tctx.  This assures that tctx's are well ordered even when
more than two with the same thr_uid coexist.  A previous attempted fix
based on prof_tctx_state_destroying was only sufficient for protecting
against two coexisting tctx's, but it also introduced a new dumping
race.

These regressions were introduced by
602c8e0971 (Implement per thread heap
profiling.) and 764b00023f (Fix a heap
profiling regression.).
2015-03-16 15:11:06 -07:00
Jason Evans
764b00023f Fix a heap profiling regression.
Add the prof_tctx_state_destroying transitionary state to fix a race
between a thread destroying a tctx and another thread creating a new
equivalent tctx.

This regression was introduced by
602c8e0971 (Implement per thread heap
profiling.).
2015-03-14 14:01:35 -07:00
Jason Evans
fbd8d773ad Fix unsigned comparison underflow.
These bugs only affected tests and debug builds.
2015-03-11 23:14:50 -07:00
Jason Evans
f5c8f37259 Normalize rdelm/rd structure field naming. 2015-03-10 18:29:49 -07:00
Jason Evans
38e42d311c Refactor dirty run linkage to reduce sizeof(extent_node_t). 2015-03-10 18:15:40 -07:00
Jason Evans
97c04a9383 Use first-fit rather than first-best-fit run/chunk allocation.
This tends to more effectively pack active memory toward low addresses.
However, additional tree searches are required in many cases, so whether
this change stands the test of time will depend on real-world
benchmarks.
2015-03-06 20:21:41 -08:00
Jason Evans
f044bb219e Change default chunk size from 4 MiB to 256 KiB.
Recent changes have improved huge allocation scalability, which removes
upward pressure to set the chunk size so large that huge allocations are
rare.  Smaller chunks are more likely to completely drain, so set the
default to the smallest size that doesn't leave excessive unusable
trailing space in chunk headers.
2015-03-06 20:18:34 -08:00
Mike Hommey
4d871f73af Preserve LastError when calling TlsGetValue
TlsGetValue has a semantic difference with pthread_getspecific, in that it
can return a non-error NULL value, so it always sets the LastError.
But allocator callers may not be expecting calling e.g. free() to change
the value of the last error, so preserve it.
2015-03-04 09:50:33 -08:00
Mike Hommey
7c46fd59cc Make --without-export actually work
9906660 added a --without-export configure option to avoid exporting
jemalloc symbols, but the option didn't actually work.
2015-03-04 21:49:15 +09:00
Jason Evans
99bd94fb65 Fix chunk cache races.
These regressions were introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
2015-02-18 16:40:53 -08:00
Jason Evans
738e089a2e Rename "dirty chunks" to "cached chunks".
Rename "dirty chunks" to "cached chunks", in order to avoid overloading
the term "dirty".

Fix the regression caused by 339c2b23b2
(Fix chunk_unmap() to propagate dirty state.), and actually address what
that change attempted, which is to only purge chunks once, and propagate
whether zeroed pages resulted into chunk_record().
2015-02-18 01:15:50 -08:00
Jason Evans
339c2b23b2 Fix chunk_unmap() to propagate dirty state.
Fix chunk_unmap() to propagate whether a chunk is dirty, and modify
dirty chunk purging to record this information so it can be passed to
chunk_unmap().  Since the broken version of chunk_unmap() claimed that
all chunks were clean, this resulted in potential memory corruption for
purging implementations that do not zero (e.g. MADV_FREE).

This regression was introduced by
ee41ad409a (Integrate whole chunks into
unused dirty page purging machinery.).
2015-02-17 22:25:56 -08:00
Jason Evans
47701b22ee arena_chunk_dirty_node_init() --> extent_node_dirty_linkage_init() 2015-02-17 22:23:10 -08:00
Jason Evans
eafebfdfbe Remove obsolete type arena_chunk_miscelms_t. 2015-02-17 16:12:31 -08:00
Jason Evans
a4e1888d1a Simplify extent_node_t and add extent_node_init(). 2015-02-17 15:13:52 -08:00
Jason Evans
ee41ad409a Integrate whole chunks into unused dirty page purging machinery.
Extend per arena unused dirty page purging to manage unused dirty chunks
in aaddtion to unused dirty runs.  Rather than immediately unmapping
deallocated chunks (or purging them in the --disable-munmap case), store
them in a separate set of trees, chunks_[sz]ad_dirty.  Preferrentially
allocate dirty chunks.  When excessive unused dirty pages accumulate,
purge runs and chunks in ingegrated LRU order (and unmap chunks in the
--enable-munmap case).

Refactor extent_node_t to provide accessor functions.
2015-02-16 21:02:17 -08:00
Jason Evans
40ab8f98e4 Remove more obsolete (incorrect) assertions.
This regression was introduced by
88fef7ceda (Refactor huge_*() calls into
arena internals.), and went undetected because of the --enable-debug
regression.
2015-02-15 20:26:45 -08:00
Jason Evans
cb9b44914e Remove obsolete (incorrect) assertions.
This regression was introduced by
88fef7ceda (Refactor huge_*() calls into
arena internals.), and went undetected because of the --enable-debug
regression.
2015-02-15 20:13:28 -08:00
Jason Evans
2195ba4e1f Normalize *_link and link_* fields to all be *_link. 2015-02-15 16:43:52 -08:00
Jason Evans
41cfe03f39 If MALLOCX_ARENA(a) is specified, use it during tcache fill. 2015-02-13 15:28:56 -08:00
Jason Evans
5f7140b045 Make prof_tctx accesses atomic.
Although exceedingly unlikely, it appears that writes to the prof_tctx
field of arena_chunk_map_misc_t could be reordered such that a stale
value could be read during deallocation, with profiler metadata
corruption and invalid pointer dereferences being the most likely
effects.
2015-02-12 15:54:53 -08:00
Jason Evans
88fef7ceda Refactor huge_*() calls into arena internals.
Make redirects to the huge_*() API the arena code's responsibility,
since arenas now take responsibility for all allocation sizes.
2015-02-12 14:06:37 -08:00
Jason Evans
cbf3a6d703 Move centralized chunk management into arenas.
Migrate all centralized data structures related to huge allocations and
recyclable chunks into arena_t, so that each arena can manage huge
allocations and recyclable virtual memory completely independently of
other arenas.

Add chunk node caching to arenas, in order to avoid contention on the
base allocator.

Use chunks_rtree to look up huge allocations rather than a red-black
tree.  Maintain a per arena unsorted list of huge allocations (which
will be needed to enumerate huge allocations during arena reset).

Remove the --enable-ivsalloc option, make ivsalloc() always available,
and use it for size queries if --enable-debug is enabled.  The only
practical implications to this removal are that 1) ivsalloc() is now
always available during live debugging (and the underlying radix tree is
available during core-based debugging), and 2) size query validation can
no longer be enabled independent of --enable-debug.

Remove the stats.chunks.{current,total,high} mallctls, and replace their
underlying statistics with simpler atomically updated counters used
exclusively for gdump triggering.  These statistics are no longer very
useful because each arena manages chunks independently, and per arena
statistics provide similar information.

Simplify chunk synchronization code, now that base chunk allocation
cannot cause recursive lock acquisition.
2015-02-12 00:15:56 -08:00
Jason Evans
051eae8cc5 Remove unnecessary xchg* lock prefixes. 2015-02-10 16:05:52 -08:00
Jason Evans
1cb181ed63 Implement explicit tcache support.
Add the MALLOCX_TCACHE() and MALLOCX_TCACHE_NONE macros, which can be
used in conjunction with the *allocx() API.

Add the tcache.create, tcache.flush, and tcache.destroy mallctls.

This resolves #145.
2015-02-09 17:44:48 -08:00
Jason Evans
23694b0745 Fix arena_get() for (!init_if_missing && refresh_if_missing) case.
Fix arena_get() to refresh the cache as needed in the (!init_if_missing
&& refresh_if_missing) case.

This flaw was introduced by the initial arena_get() implementation,
which was part of 8bb3198f72 (Refactor/fix
arenas manipulation.).
2015-02-09 17:43:10 -08:00
Jason Evans
8d0e04d42f Refactor rtree to be lock-free.
Recent huge allocation refactoring associates huge allocations with
arenas, but it remains necessary to quickly look up huge allocation
metadata during reallocation/deallocation.  A global radix tree remains
a good solution to this problem, but locking would have become the
primary bottleneck after (upcoming) migration of chunk management from
global to per arena data structures.

This lock-free implementation uses double-checked reads to traverse the
tree, so that in the steady state, each read or write requires only a
single atomic operation.

This implementation also assures that no more than two tree levels
actually exist, through a combination of careful virtual memory
allocation which makes large sparse nodes cheap, and skipping the root
node on x64 (possible because the top 16 bits are all 0 in practice).
2015-02-04 16:51:53 -08:00
Jason Evans
c810fcea1f Add (x != 0) assertion to lg_floor(x).
lg_floor(0) is undefined, but depending on compiler options may not
cause a crash.  This assertion makes it harder to accidentally abuse
lg_floor().
2015-02-04 16:51:53 -08:00
Jason Evans
f500a10b2e Refactor base_alloc() to guarantee demand-zeroed memory.
Refactor base_alloc() to guarantee that allocations are carved from
demand-zeroed virtual memory.  This supports sparse data structures such
as multi-page radix tree nodes.

Enhance base_alloc() to keep track of fragments which were too small to
support previous allocation requests, and try to consume them during
subsequent requests.  This becomes important when request sizes commonly
approach or exceed the chunk size (as could radix tree node
allocations).
2015-02-04 16:51:53 -08:00
Jason Evans
918a1a5b3f Reduce extent_node_t size to fit in one cache line. 2015-02-04 16:51:53 -08:00
Jason Evans
a55dfa4b0a Implement more atomic operations.
- atomic_*_p().
- atomic_cas_*().
- atomic_write_*().
2015-02-04 16:50:05 -08:00
Jason Evans
f8723572d8 Add missing prototypes for bootstrap_{malloc,calloc,free}(). 2015-02-04 16:50:04 -08:00
Jason Evans
5b8ed5b7c9 Implement the prof.gdump mallctl.
This feature makes it possible to toggle the gdump feature on/off during
program execution, whereas the the opt.prof_dump mallctl value can only
be set during program startup.

This resolves #72.
2015-01-25 21:21:35 -08:00
Jason Evans
4581b97809 Implement metadata statistics.
There are three categories of metadata:

- Base allocations are used for bootstrap-sensitive internal allocator
  data structures.
- Arena chunk headers comprise pages which track the states of the
  non-metadata pages.
- Internal allocations differ from application-originated allocations
  in that they are for internal use, and that they are omitted from heap
  profiles.

The metadata statistics comprise the metadata categories as follows:

- stats.metadata: All metadata -- base + arena chunk headers + internal
  allocations.
- stats.arenas.<i>.metadata.mapped: Arena chunk headers.
- stats.arenas.<i>.metadata.allocated: Internal allocations.  This is
  reported separately from the other metadata statistics because it
  overlaps with the allocated and active statistics, whereas the other
  metadata statistics do not.

Base allocations are not reported separately, though their magnitude can
be computed by subtracting the arena-specific metadata.

This resolves #163.
2015-01-23 23:34:43 -08:00
Jason Evans
10aff3f3e1 Refactor bootstrapping to delay tsd initialization.
Refactor bootstrapping to delay tsd initialization, primarily to support
integration with FreeBSD's libc.

Refactor a0*() for internal-only use, and add the
bootstrap_{malloc,calloc,free}() API for use by FreeBSD's libc.  This
separation limits use of the a0*() functions to metadata allocation,
which doesn't require malloc/calloc/free API compatibility.

This resolves #170.
2015-01-22 14:04:27 -08:00
Abhishek Kulkarni
b617df81bb Add missing symbols to private_symbols.txt.
This resolves #185.
2015-01-21 12:44:35 -08:00
Guilherme Goncalves
51f86346c0 Add a isblank definition for MSVC < 2013 2015-01-09 14:33:46 -08:00
Guilherme Goncalves
2c5cb613df Introduce two new modes of junk filling: "alloc" and "free".
In addition to true/false, opt.junk can now be either "alloc" or "free",
giving applications the possibility of junking memory only on allocation
or deallocation.

This resolves #172.
2014-12-14 17:07:26 -08:00
Daniel Micay
b74041fb6e Ignore MALLOC_CONF in set{uid,gid,cap} binaries.
This eliminates the malloc tunables as tools for an attacker.

Closes #173
2014-12-14 15:36:15 -08:00
Jason Evans
e12eaf93dc Style and spelling fixes. 2014-12-08 16:34:04 -08:00
Chih-hung Hsieh
59cd80e6c6 Add a C11 atomics-based implementation of atomic.h API. 2014-12-06 21:17:49 -08:00
Jason Evans
a18c2b1f15 Style fixes. 2014-12-05 17:49:47 -08:00
Daniel Micay
879e76a9e5 teach the dss chunk allocator to handle new_addr
This provides in-place expansion of huge allocations when the end of the
allocation is at the end of the sbrk heap. There's already the ability
to extend in-place via recycled chunks but this handles the initial
growth of the heap via repeated vector / string reallocations.

A possible future extension could allow realloc to go from the following:

    | huge allocation | recycled chunks |
                                        ^ dss_end

To a larger allocation built from recycled *and* new chunks:

    |                      huge allocation                      |
                                                                ^ dss_end

Doing that would involve teaching the chunk recycling code to request
new chunks to satisfy the request. The chunk_dss code wouldn't require
any further changes.

    #include <stdlib.h>

    int main(void) {
        size_t chunk = 4 * 1024 * 1024;
        void *ptr = NULL;
        for (size_t size = chunk; size < chunk * 128; size *= 2) {
            ptr = realloc(ptr, size);
            if (!ptr) return 1;
        }
    }

dss:secondary: 0.083s
dss:primary: 0.083s

After:

dss:secondary: 0.083s
dss:primary: 0.003s

The dss heap grows in the upwards direction, so the oldest chunks are at
the low addresses and they are used first. Linux prefers to grow the
mmap heap downwards, so the trick will not work in the *current* mmap
chunk allocator as a huge allocation will only be at the top of the heap
in a contrived case.
2014-11-28 16:11:19 -08:00
Guilherme Goncalves
a2136025c4 Remove extra definition of je_tsd_boot on win32. 2014-11-18 19:08:18 -02:00
Jason Evans
9cf2be0a81 Make quarantine_init() static. 2014-11-07 14:50:38 -08:00
Jason Evans
c002a5c800 Fix two quarantine regressions.
Fix quarantine to actually update tsd when expanding, and to avoid
double initialization (leaking the first quarantine) due to recursive
initialization.

This resolves #161.
2014-11-04 18:03:11 -08:00
Jason Evans
d7a9bab92d Fix arena_sdalloc() to use promoted size (second attempt).
Unlike the preceeding attempted fix, this version avoids the potential
for converting an invalid bin index to a size class.
2014-10-31 22:26:24 -07:00
Jason Evans
6da2e9d4f6 Fix arena_sdalloc() to use promoted size. 2014-10-31 17:08:13 -07:00
Jason Evans
cfc5706f69 Miscellaneous cleanups. 2014-10-30 23:18:45 -07:00
Daniel Micay
d33f834591 avoid redundant chunk header reads
* use sized deallocation in iralloct_realign
* iralloc and ixalloc always need the old size, so pass it in from the
  caller where it's often already calculated
2014-10-30 17:06:38 -07:00
Daniel Micay
809b0ac391 mark huge allocations as unlikely
This cleans up the fast path a bit more by moving away more code.
2014-10-30 17:06:38 -07:00
Jason Evans
9b41ac909f Fix huge allocation statistics. 2014-10-14 22:20:00 -07:00
Jason Evans
3c4d92e82a Add per size class huge allocation statistics.
Add per size class huge allocation statistics, and normalize various
stats:
- Change the arenas.nlruns type from size_t to unsigned.
- Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's.
- Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with
  stats.arenas.<i>.bins.<j>.curregs .
- Add the stats.arenas.<i>.hchunks.<j>.nmalloc,
  stats.arenas.<i>.hchunks.<j>.ndalloc,
  stats.arenas.<i>.hchunks.<j>.nrequests, and
  stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
2014-10-12 23:02:10 -07:00
Jason Evans
44c97b712e Fix a prof_tctx_t/prof_tdata_t cleanup race.
Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid
in prof_tctx_t, so that the associated tdata need not be present during
tctx teardown.
2014-10-12 13:03:20 -07:00
Jason Evans
381c23dd9d Remove arena_dalloc_bin_run() clean page preservation.
Remove code in arena_dalloc_bin_run() that preserved the "clean" state
of trailing clean pages by splitting them into a separate run during
deallocation.  This was a useful mechanism for reducing dirty page
churn when bin runs comprised many pages, but bin runs are now quite
small.

Remove the nextind field from arena_run_t now that it is no longer
needed, and change arena_run_t's bin field (arena_bin_t *) to binind
(index_t).  These two changes remove 8 bytes of chunk header overhead
per page, which saves 1/512 of all arena chunk memory.
2014-10-10 23:01:03 -07:00
Jason Evans
81e547566e Add --with-lg-tiny-min, generalize --with-lg-quantum. 2014-10-10 22:35:07 -07:00
Jason Evans
fc0b3b7383 Add configure options.
Add:
  --with-lg-page
  --with-lg-page-sizes
  --with-lg-size-class-group
  --with-lg-quantum

Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE.

Fix various edge conditions exposed by the configure options.
2014-10-09 22:44:37 -07:00
Daniel Micay
f22214a29d Use regular arena allocation for huge tree nodes.
This avoids grabbing the base mutex, as a step towards fine-grained
locking for huge allocations. The thread cache also provides a tiny
(~3%) improvement for serial huge allocations.
2014-10-07 23:57:09 -07:00
Jason Evans
8bb3198f72 Refactor/fix arenas manipulation.
Abstract arenas access to use arena_get() (or a0get() where appropriate)
rather than directly reading e.g. arenas[ind].  Prior to the addition of
the arenas.extend mallctl, the worst possible outcome of directly
accessing arenas was a stale read, but arenas.extend may allocate and
assign a new array to arenas.

Add a tsd-based arenas_cache, which amortizes arenas reads.  This
introduces some subtle bootstrapping issues, with tsd_boot() now being
split into tsd_boot[01]() to support tsd wrapper allocation
bootstrapping, as well as an arenas_cache_bypass tsd variable which
dynamically terminates allocation of arenas_cache itself.

Promote a0malloc(), a0calloc(), and a0free() to be generally useful for
internal allocation, and use them in several places (more may be
appropriate).

Abstract arena->nthreads management and fix a missing decrement during
thread destruction (recent tsd refactoring left arenas_cleanup()
unused).

Change arena_choose() to propagate OOM, and handle OOM in all callers.
This is important for providing consistent allocation behavior when the
MALLOCX_ARENA() flag is being used.  Prior to this fix, it was possible
for an OOM to result in allocation silently allocating from a different
arena than the one specified.
2014-10-07 23:14:57 -07:00
Jason Evans
155bfa7da1 Normalize size classes.
Normalize size classes to use the same number of size classes per size
doubling (currently hard coded to 4), across the intire range of size
classes.  Small size classes already used this spacing, but in order to
support this change, additional small size classes now fill [4 KiB .. 16
KiB).  Large size classes range from [16 KiB .. 4 MiB).  Huge size
classes now support non-multiples of the chunk size in order to fill (4
MiB .. 16 MiB).
2014-10-06 01:45:13 -07:00
Daniel Micay
a95018ee81 Attempt to expand huge allocations in-place.
This adds support for expanding huge allocations in-place by requesting
memory at a specific address from the chunk allocator.

It's currently only implemented for the chunk recycling path, although
in theory it could also be done by optimistically allocating new chunks.
On Linux, it could attempt an in-place mremap. However, that won't work
in practice since the heap is grown downwards and memory is not unmapped
(in a normal build, at least).

Repeated vector reallocation micro-benchmark:

    #include <string.h>
    #include <stdlib.h>

    int main(void) {
        for (size_t i = 0; i < 100; i++) {
            void *ptr = NULL;
            size_t old_size = 0;
            for (size_t size = 4; size < (1 << 30); size *= 2) {
                ptr = realloc(ptr, size);
                if (!ptr) return 1;
                memset(ptr + old_size, 0xff, size - old_size);
                old_size = size;
            }
            free(ptr);
        }
    }

The glibc allocator fails to do any in-place reallocations on this
benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it
elides the cost of copies via mremap, which is currently not something
that jemalloc can use.

With this improvement, jemalloc still fails to do any in-place huge
reallocations for the first outer loop, but then succeeds 100% of the
time for the remaining 99 iterations. The time spent doing allocations
and copies drops down to under 5%, with nearly all of it spent doing
purging + faulting (when huge pages are disabled) and the array memset.

An improved mremap API (MREMAP_RETAIN - #138) would be far more general
but this is a portable optimization and would still be useful on Linux
for xallocx.

Numbers with transparent huge pages enabled:

glibc (copies elided via MREMAP_MAYMOVE): 8.471s

jemalloc: 17.816s
jemalloc + no-op madvise: 13.236s

jemalloc + this commit: 6.787s
jemalloc + this commit + no-op madvise: 6.144s

Numbers with transparent huge pages disabled:

glibc (copies elided via MREMAP_MAYMOVE): 15.403s

jemalloc: 39.456s
jemalloc + no-op madvise: 12.768s

jemalloc + this commit: 15.534s
jemalloc + this commit + no-op madvise: 6.354s

Closes #137
2014-10-05 14:47:01 -07:00
Jason Evans
e9a3fa2e09 Add missing header includes in jemalloc/jemalloc.h .
Add stdlib.h, stdbool.h, and stdint.h to jemalloc/jemalloc.h so that
applications only have to #include <jemalloc/jemalloc.h>.

This resolves #132.
2014-10-05 12:05:37 -07:00
Jason Evans
16854ebeb7 Don't disable tcache for lazy-lock.
Don't disable tcache when lazy-lock is configured.  There already exists
a mechanism to disable tcache, but doing so automatically due to
lazy-lock causes surprising performance behavior.
2014-10-04 15:00:51 -07:00
Jason Evans
34e85b4182 Make prof-related inline functions always-inline. 2014-10-04 11:26:05 -07:00
Jason Evans
029d44cf8b Fix tsd cleanup regressions.
Fix tsd cleanup regressions that were introduced in
5460aa6f66 (Convert all tsd variables to
reside in a single tsd structure.).  These regressions were twofold:

1) tsd_tryget() should never (and need never) return NULL.  Rename it to
   tsd_fetch() and simplify all callers.
2) tsd_*_set() must only be called when tsd is in the nominal state,
   because cleanup happens during the nominal-->purgatory transition,
   and re-initialization must not happen while in the purgatory state.
   Add tsd_nominal() and use it as needed.  Note that tsd_*{p,}_get()
   can still be used as long as no re-initialization that would require
   cleanup occurs.  This means that e.g. the thread_allocated counter
   can be updated unconditionally.
2014-10-04 11:22:55 -07:00
Jason Evans
fc12c0b8bc Implement/test/fix prof-related mallctl's.
Implement/test/fix the opt.prof_thread_active_init,
prof.thread_active_init, and thread.prof.active mallctl's.

Test/fix the thread.prof.name mallctl.

Refactor opt_prof_active to be read-only and move mutable state into the
prof_active variable.  Stop leaning on ctl-related locking for
protection.
2014-10-03 23:25:30 -07:00
Jason Evans
551ebc4364 Convert to uniform style: cond == false --> !cond 2014-10-03 10:16:09 -07:00
Jason Evans
20c31deaae Test prof.reset mallctl and fix numerous discovered bugs. 2014-10-02 23:01:10 -07:00
Eric Wong
4dcf04bfc0 correctly detect adaptive mutexes in pthreads
PTHREAD_MUTEX_ADAPTIVE_NP is an enum on glibc and not a macro,
we must test for their existence by attempting compilation.
2014-09-29 16:10:40 -07:00
Jason Evans
5d9732f2cf Merge pull request #129 from daverigby/msvc_lg_floor
Use MSVC intrinsics for lg_floor
2014-09-29 15:15:31 -07:00
Jason Evans
0c5dd03e88 Move small run metadata into the arena chunk header.
Move small run metadata into the arena chunk header, with multiple
expected benefits:
- Lower run fragmentation due to reduced run sizes; runs are more likely
  to completely drain when there are fewer total regions.
- Improved cache behavior.  Prior to this change, run headers were
  always page-aligned, which put extra pressure on some CPU cache sets.
  The degree to which this was a problem was hardware dependent, but it
  likely hurt some even for the most advanced modern hardware.
- Buffer overruns/underruns are less likely to corrupt allocator
  metadata.
- Size classes between 4 KiB and 16 KiB become reasonable to support
  without any special handling, and the runs are small enough that dirty
  unused pages aren't a significant concern.
2014-09-29 01:31:39 -07:00
Jason Evans
f97e5ac4ec Implement compile-time bitmap size computation. 2014-09-28 14:43:11 -07:00
Jason Evans
6ef80d68f0 Fix profile dumping race.
Fix a race that caused a non-critical assertion failure.  To trigger the
race, a thread had to be part way through initializing a new sample,
such that it was discoverable by the dumping thread, but not yet linked
into its gctx by the time a later dump phase would normally have reset
its state to 'nominal'.

Additionally, lock access to the state field during modification to
transition to the dumping state.  It's not apparent that this oversight
could have caused an actual problem due to outer locking that protects
the dumping machinery, but the added locking pedantically follows the
stated locking protocol for the state field.
2014-09-24 22:23:43 -07:00
Dave Rigby
112704cfbf Use MSVC intrinsics for lg_floor
When using MSVC make use of its intrinsic functions (supported on
x86, amd64 & ARM) for lg_floor.
2014-09-24 11:55:02 +01:00
Jason Evans
5460aa6f66 Convert all tsd variables to reside in a single tsd structure. 2014-09-23 02:36:08 -07:00
Jason Evans
9c640bfdd4 Apply likely()/unlikely() to allocation/deallocation fast paths. 2014-09-11 17:01:58 -07:00
Daniel Micay
23fdf8b359 mark some conditions as unlikely
* assertion failure
* malloc_init failure
* malloc not already initialized (in malloc_init)
* running in valgrind
* thread cache disabled at runtime

Clang and GCC already consider a comparison with NULL or -1 to be cold,
so many branches (out-of-memory) are already correctly considered as
cold and marking them is not important.
2014-09-10 21:49:42 -04:00
Daniel Micay
6b5609d23b add likely / unlikely macros 2014-09-10 17:36:32 -04:00
Jason Evans
6e73dc194e Fix a profile sampling race.
Fix a profile sampling race that was due to preparing to sample, yet
doing nothing to assure that the context remains valid until the stats
are updated.

These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
2014-09-09 19:47:09 -07:00
Jason Evans
6fd53da030 Fix prof_tdata_get()-related regressions.
Fix prof_tdata_get() to avoid dereferencing an invalid tdata pointer
(when it's PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).

Fix prof_tdata_get() callers to check for invalid results besides NULL
(PROF_TDATA_STATE_{REINCARNATED,PURGATORY}).

These regressions were caused by
602c8e0971 (Implement per thread heap
profiling.), which did not make it into any releases prior to these
fixes.
2014-09-09 15:29:34 -07:00
Daniel Micay
a62812eacc fix isqalloct (should call isdalloct) 2014-09-08 21:46:17 -04:00
Daniel Micay
4cfe55166e Add support for sized deallocation.
This adds a new `sdallocx` function to the external API, allowing the
size to be passed by the caller.  It avoids some extra reads in the
thread cache fast path.  In the case where stats are enabled, this
avoids the work of calculating the size from the pointer.

An assertion validates the size that's passed in, so enabling debugging
will allow users of the API to debug cases where an incorrect size is
passed in.

The performance win for a contrived microbenchmark doing an allocation
and immediately freeing it is ~10%.  It may have a different impact on a
real workload.

Closes #28
2014-09-08 17:34:24 -07:00
Jason Evans
c3f8650749 Add relevant function attributes to [msn]allocx(). 2014-09-08 16:47:51 -07:00
Jason Evans
82e88d1ecf Move typedefs from jemalloc_protos.h.in to jemalloc_typedefs.h.in.
Move typedefs from jemalloc_protos.h.in to jemalloc_typedefs.h.in, so
that typedefs aren't redefined when compiling stress tests.
2014-09-07 19:55:03 -07:00
Jason Evans
b718cf77e9 Optimize [nmd]alloc() fast paths.
Optimize [nmd]alloc() fast paths such that the (flags == 0) case is
streamlined, flags decoding only happens to the minimum degree
necessary, and no conditionals are repeated.
2014-09-07 14:40:19 -07:00
Jason Evans
c21b05ea09 Whitespace cleanups. 2014-09-04 22:27:26 -07:00
Qinfan Wu
ff6a31d3b9 Refactor chunk map.
Break the chunk map into two separate arrays, in order to improve cache
locality. This is related to issue #23.
2014-09-04 22:22:52 -07:00
Sara Golemon
3e24afa28e Test for availability of malloc hooks via autoconf
__*_hook() is glibc, but on at least one glibc platform (homebrew),
the __GLIBC__ define isn't set correctly and we miss being able to
use these hooks.

Do a feature test for it during configuration so that we enable it
anywhere the hooks are actually available.
2014-08-22 15:19:21 -07:00
Jason Evans
602c8e0971 Implement per thread heap profiling.
Rename data structures (prof_thr_cnt_t-->prof_tctx_t,
prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for
sampled objects.

Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace
depth within jemalloc functions is no longer an issue (pprof prunes
irrelevant frames).

Implement mallctl's:
- prof.reset implements full sample data reset, and optional change of
  sample interval.
- prof.lg_sample reads the current sample interval (opt.lg_prof_sample
  was the permanent source of truth prior to prof.reset).
- thread.prof.name provides naming capability for threads within heap
  profile dumps.
- thread.prof.active makes it possible to activate/deactivate heap
  profiling for individual threads.

Modify the heap dump files to contain per thread heap profile data.
This change is incompatible with the existing pprof, which will require
enhancements to read and process the enriched data.
2014-08-19 21:31:16 -07:00
Jason Evans
1628e8615e Add rb_empty(). 2014-08-19 21:05:54 -07:00
Jason Evans
3a81cbd2d4 Dump heap profile backtraces in a stable order.
Also iterate over per thread stats in a stable order, which prepares the
way for stable ordering of per thread heap profile dumps.
2014-08-19 21:05:54 -07:00
Jason Evans
ab532e9799 Directly embed prof_ctx_t's bt. 2014-08-19 21:05:54 -07:00
Jason Evans
b41ccdb125 Convert prof_tdata_t's bt2cnt to a comprehensive map.
Treat prof_tdata_t's bt2cnt as a comprehensive map of the thread's
extant allocation samples (do not limit the total number of entries).
This helps prepare the way for per thread heap profiling.
2014-08-19 21:05:54 -07:00
Jason Evans
070b3c3fbd Fix and refactor runs_dirty-based purging.
Fix runs_dirty-based purging to also purge dirty pages in the spare
chunk.

Refactor runs_dirty manipulation into arena_dirty_{insert,remove}(), and
move the arena->ndirty accounting into those functions.

Remove the u.ql_link field from arena_chunk_map_t, and get rid of the
enclosing union for u.rb_link, since only rb_link remains.

Remove the ndirty field from arena_chunk_t.
2014-08-14 14:45:58 -07:00
Qinfan Wu
e8a2fd83a2 arena->npurgatory is no longer needed since we drop arena's lock
after stashing all the purgeable runs.
2014-08-12 09:50:01 -07:00
Qinfan Wu
90737fcda1 Remove chunks_dirty tree, nruns_avail and nruns_adjac since we no
longer need to maintain the tree for dirty page purging.
2014-08-12 09:50:00 -07:00
Qinfan Wu
04d60a132b Maintain all the dirty runs in a linked list for each arena 2014-08-12 09:50:00 -07:00
Jason Evans
a2ea54c986 Add atomic operations tests and fix latent bugs. 2014-08-06 23:36:19 -07:00
Manuel A. Fernandez Montecelo
ffa259841c Add OpenRISC/or1k LG_QUANTUM size definition 2014-07-29 23:11:26 +01:00
Richard Diamond
994fad9bda Add check for madvise(2) to configure.ac.
Some platforms, such as Google's Portable Native Client, use Newlib and
thus lack access to madvise(2).  In those instances, pages_purge() is
transformed into a no-op.
2014-06-03 09:32:49 -07:00
Richard Diamond
9c3a10fdf6 Try to use __builtin_ffsl if ffsl is unavailable.
Some platforms (like those using Newlib) don't have ffs/ffsl.  This
commit adds a check to configure.ac for __builtin_ffsl if ffsl isn't
found.  __builtin_ffsl performs the same function as ffsl, and has the
added benefit of being available on any platform utilizing
Gcc-compatible compiler.

This change does not address the used of ffs in the MALLOCX_ARENA()
macro.
2014-06-02 07:44:50 -07:00
Jason Evans
0b5c92213f Fix fallback lg_floor() implementations. 2014-06-01 22:05:08 -07:00
Jason Evans
1f6d77e1f6 Use KQU() rather than QU() where applicable.
Fix KZI() and KQI() to append LL rather than ULL.
2014-05-28 21:17:42 -07:00
Jason Evans
d04047cc29 Add size class computation capability.
Add size class computation capability, currently used only as validation
of the size class lookup tables.  Generalize the size class spacing used
for bins, for eventual use throughout the full range of allocation
sizes.
2014-05-28 21:06:46 -07:00
Mike Hommey
12f74e680c Move platform headers and tricks from jemalloc_internal.h.in to a new jemalloc_internal_decls.h header 2014-05-28 09:38:10 -07:00
Mike Hommey
22bc570fba Move __func__ to jemalloc_internal_macros.h
test/integration/aligned_alloc.c needs it.
2014-05-27 15:52:49 -07:00
Mike Hommey
a9df1ae622 Use ULL prefix instead of LLU for unsigned long longs
MSVC only supports the former.
2014-05-27 15:45:14 -07:00
Jason Evans
e2deab7a75 Refactor huge allocation to be managed by arenas.
Refactor huge allocation to be managed by arenas (though the global
red-black tree of huge allocations remains for lookup during
deallocation).  This is the logical conclusion of recent changes that 1)
made per arena dss precedence apply to huge allocation, and 2) made it
possible to replace the per arena chunk allocation/deallocation
functions.

Remove the top level huge stats, and replace them with per arena huge
stats.

Normalize function names and types to *dalloc* (some were *dealloc*).

Remove the --enable-mremap option.  As jemalloc currently operates, this
is a performace regression for some applications, but planned work to
logarithmically space huge size classes should provide similar amortized
performance.  The motivation for this change was that mremap-based huge
reallocation forced leaky abstractions that prevented refactoring.
2014-05-15 22:36:41 -07:00
aravind
fb7fe50a88 Add support for user-specified chunk allocators/deallocators.
Add new mallctl endpoints "arena<i>.chunk.alloc" and
"arena<i>.chunk.dealloc" to allow userspace to configure
jemalloc's chunk allocator and deallocator on a per-arena
basis.
2014-05-12 10:46:03 -07:00
Jason Evans
6f001059aa Simplify backtracing.
Simplify backtracing to not ignore any frames, and compensate for this
in pprof in order to increase flexibility with respect to function-based
refactoring even in the presence of non-deterministic inlining.  Modify
pprof to blacklist all jemalloc allocation entry points including
non-standard ones like mallocx(), and ignore all allocator-internal
frames.  Prior to this change, pprof excluded the specifically
blacklisted functions from backtraces, but it left allocator-internal
frames intact.
2014-04-22 20:55:09 -07:00
Lucian Adrian Grijincu
9d4e13f45a prof_backtrace: use unw_backtrace
unw_backtrace:
- does internal per-thread caching
- doesn't acquire an internal lock
2014-04-22 18:39:47 -07:00
Jason Evans
3541a904d6 Refactor small_size2bin and small_bin2size.
Refactor small_size2bin and small_bin2size to be inline functions rather
than directly accessed arrays.
2014-04-16 17:14:33 -07:00
Jason Evans
0b49403958 Fix debug-only compilation failures.
Fix debug-only compilation failures introduced by changes to
prof_sample_accum_update() in:

    6c39f9e059
    refactor profiling. only use a bytes till next sample variable.
2014-04-16 16:38:22 -07:00
Jason Evans
3e3caf03af Merge pull request #73 from bmaurer/smallmalloc
Smaller malloc hot path
2014-04-16 16:33:21 -07:00
Ben Maurer
021136ce4d Create a const array with only a small bin to size map 2014-04-16 14:31:24 -07:00
Ben Maurer
6c39f9e059 refactor profiling. only use a bytes till next sample variable. 2014-04-16 13:43:30 -07:00
Ben Maurer
a7619b7fa5 outline rare tcache_get codepaths 2014-04-16 13:36:56 -07:00
Jason Evans
bd87b01999 Optimize Valgrind integration.
Forcefully disable tcache if running inside Valgrind, and remove
Valgrind calls in tcache-specific code.

Restructure Valgrind-related code to move most Valgrind calls out of the
fast path functions.

Take advantage of static knowledge to elide some branches in
JEMALLOC_VALGRIND_REALLOC().
2014-04-15 16:49:57 -07:00
Jason Evans
ecd3e59ca3 Remove the "opt.valgrind" mallctl.
Remove the "opt.valgrind" mallctl because it is unnecessary -- jemalloc
automatically detects whether it is running inside valgrind.
2014-04-15 14:33:50 -07:00
Jason Evans
4d434adb14 Make dss non-optional, and fix an "arena.<i>.dss" mallctl bug.
Make dss non-optional on all platforms which support sbrk(2).

Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
"secondary" precedence is specified, but sbrk(2) is not supported.
2014-04-15 12:09:48 -07:00
Jason Evans
9790b9667f Remove the *allocm() API, which is superceded by the *allocx() API. 2014-04-14 22:32:31 -07:00
Jason Evans
9b0cbf0850 Remove support for non-prof-promote heap profiling metadata.
Make promotion of sampled small objects to large objects mandatory, so
that profiling metadata can always be stored in the chunk map, rather
than requiring one pointer per small region in each small-region page
run.  In practice the non-prof-promote code was only useful when using
jemalloc to track all objects and report them as leaks at program exit.
However, Valgrind is at least as good a tool for this particular use
case.

Furthermore, the non-prof-promote code is getting in the way of
some optimizations that will make heap profiling much cheaper for the
predominant use case (sampling a small representative proportion of all
allocations).
2014-04-11 14:24:51 -07:00
Ben Maurer
be8e59f5a6 Don't dereference chunk->arena in free() hot path
When you call free() we load chunk->arena even though that
data isn't used on the tcache hot path.

In profiling some FB applications, I found that ~30% of the
dTLB misses in the free() function come from this line. With
4 MB chunks, the arena_chunk_t->map is ~ 32 KB (1024 pages
in the chunk, 4 8 byte pointers in arena_chunk_map_t). This
means there's only a 1/8 chance of the page containing
chunk->arena also comtaining the map bits.
2014-04-05 15:59:08 -07:00
Jason Evans
8a26eaca7f Add private namespace mangling for huge_dss_prec_get(). 2014-03-31 09:31:38 -07:00
Jason Evans
df3f27024f Adapt hash tests to big-endian systems.
The hash code, which has MurmurHash3 at its core, generates different
output depending on system endianness, so adapt the expected output on
big-endian systems.  MurmurHash3 code also makes the assumption that
unaligned access is okay (not true on all systems), but jemalloc only
hashes data structures that have sufficient alignment to dodge this
limitation.
2014-03-30 16:27:08 -07:00
Max Wang
fbb31029a5 Use arena dss prec instead of default for huge allocs.
Pass a dss_prec_t parameter to huge_{m,p,r}alloc instead of defaulting
to the chunk dss prec.
2014-03-28 13:43:58 -07:00
Jason Evans
99b0fbbe69 Add workaround for missing 'restrict' keyword.
Add a cpp #define that removes 'restrict' keyword usage unless the
compiler definitely supports C99.  As written, 'restrict' is only
enabled if the compiler supports the -std=gnu99 option (e.g. gcc and
llvm).

Reported by Tobias Hieta.
2014-02-24 16:08:38 -08:00
Jason Evans
5f60afa01e Avoid a compiler warning.
Avoid copying "jeprof" to a 1-byte buffer within prof_boot0() when heap
profiling is disabled.  Although this is dead code under such
conditions, the compiler doesn't figure that part out.

Reported by Eduardo Silva.
2014-01-28 23:04:02 -08:00
Jason Evans
0dec3507c6 Remove __FBSDID from rb.h. 2014-01-21 20:49:58 -08:00
Jason Evans
772163b4f3 Add heap profiling tests.
Fix a regression in prof_dump_ctx() due to an uninitized variable.  This
was caused by revision 4f37ef693e, so no
releases are affected.
2014-01-17 15:40:52 -08:00
Jason Evans
eefdd02e70 Fix a variable prototype/definition mismatch. 2014-01-16 18:04:30 -08:00
Jason Evans
f234dc51b9 Fix name mangling for stress tests.
Fix stress tests such that testlib code uses the jet_ allocator, but
test code uses libjemalloc.

Generate jemalloc_{rename,mangle}.h, the former because it's needed for
the stress test name mangling fix, and the latter for consistency.  As
an artifact of this change, some (but not all) definitions related to
the experimental API are absent from the headers unless the feature is
enabled at configure time.
2014-01-16 17:38:01 -08:00
Jason Evans
4f37ef693e Refactor prof_dump() to reduce contention.
Refactor prof_dump() to use a two pass algorithm, and prof_leave() prior
to the second pass.  This avoids write(2) system calls while holding
critical prof resources.

Fix prof_dump() to close the dump file descriptor for all relevant error
paths.

Minimize the size of prof-related static buffers when prof is disabled.
This saves roughly 65 KiB of application memory for non-prof builds.

Refactor prof_ctx_init() out of prof_lookup_global().
2014-01-16 13:36:38 -08:00
Jason Evans
aa5113b1fd Refactor overly large/complex functions.
Refactor overly large functions by breaking out helper functions.

Refactor overly complex multi-purpose functions into separate more
specific functions.
2014-01-14 16:23:03 -08:00
Jason Evans
b2c31660be Extract profiling code from [re]allocation functions.
Extract profiling code from malloc(), imemalign(), calloc(), realloc(),
mallocx(), rallocx(), and xallocx().  This slightly reduces the amount
of code compiled into the fast paths, but the primary benefit is the
combinatorial complexity reduction.

Simplify iralloc[t]() by creating a separate ixalloc() that handles the
no-move cases.

Further simplify [mrxn]allocx() (and by implication [mrn]allocm()) to
make request size overflows due to size class and/or alignment
constraints trigger undefined behavior (detected by debug-only
assertions).

Report ENOMEM rather than EINVAL if an OOM occurs during heap profiling
backtrace creation in imemalign().  This bug impacted posix_memalign()
and aligned_alloc().
2014-01-12 15:41:05 -08:00
Jason Evans
6b694c4d47 Add junk/zero filling unit tests, and fix discovered bugs.
Fix growing large reallocation to junk fill new space.

Fix huge deallocation to junk fill when munmap is disabled.
2014-01-07 16:54:17 -08:00
Jason Evans
e18c25d23d Add util unit tests, and fix discovered bugs.
Add unit tests for pow2_ceil(), malloc_strtoumax(), and
malloc_snprintf().

Fix numerous bugs in malloc_strotumax() error handling/reporting.  These
bugs could have caused application-visible issues for some seldom used
(0X... and 0... prefixes) or malformed MALLOC_CONF or mallctl() argument
strings, but otherwise they had no impact.

Fix numerous bugs in malloc_snprintf().  These bugs were not exercised
by existing malloc_*printf() calls, so they had no impact.
2014-01-06 20:41:09 -08:00
Jason Evans
b954bc5d3a Convert rtree from (void *) to (uint8_t) storage.
Reduce rtree memory usage by storing booleans (1 byte each) rather than
pointers.  The rtree code is only used to record whether jemalloc manages
a chunk of memory, so there's no need to store pointers in the rtree.

Increase rtree node size to 64 KiB in order to reduce tree depth from 13
to 3 on 64-bit systems.  The conversion to more compact leaf nodes was
enough by itself to make the rtree depth 1 on 32-bit systems; due to the
fact that root nodes are smaller than the specified node size if
possible, the node size change has no impact on 32-bit systems (assuming
default chunk size).
2014-01-02 17:36:38 -08:00
Jason Evans
b980cc774a Add rtree unit tests. 2014-01-02 16:17:15 -08:00
Jason Evans
1b75b4e6d1 Add missing prototypes. 2013-12-17 15:30:49 -08:00