server-skynet-source-3rd-jemalloc

project-base/server-skynet-source-3rd-jemalloc

Author	SHA1	Message	Date
Jason Evans	36195c8f4d	Disable percpu_arena by default.	2017-05-23 15:32:50 -07:00
Qi Wang	eeefdf3ce8	Fix # of unpurged pages in decay algorithm. When # of dirty pages move below npages_limit (e.g. they are reused), we should not lower number of unpurged pages because that would cause the reused pages to be double counted in the backlog (as a result, decay happen slower than it should). Instead, set number of unpurged to the greater of current npages and npages_limit. Added an assertion: the ceiling # of pages should be greater than npages_limit.	2017-05-23 13:48:30 -07:00
Qi Wang	0eae838b0d	Check for background thread inactivity on extents_dalloc. To avoid background threads sleeping forever with idle arenas, we eagerly check background threads' sleep time after extents_dalloc, and signal the thread if necessary.	2017-05-23 12:26:20 -07:00
Qi Wang	5f5ed2198e	Add profiling for the background thread mutex.	2017-05-23 12:26:20 -07:00
Qi Wang	2bee0c6251	Add background thread related stats.	2017-05-23 12:26:20 -07:00
Qi Wang	b693c7868e	Implementing opt.background_thread. Added opt.background_thread to enable background threads, which handles purging currently. When enabled, decay ticks will not trigger purging (which will be left to the background threads). We limit the max number of threads to NCPUs. When percpu arena is enabled, set CPU affinity for the background threads as well. The sleep interval of background threads is dynamic and determined by computing number of pages to purge in the future (based on backlog).	2017-05-23 12:26:20 -07:00
David Goldblatt	3f685e8824	Protect the rtree/extent interactions with a mutex pool. Instead of embedding a lock bit in rtree leaf elements, we associate extents with a small set of mutexes. This gets us two things: - We can use the system mutexes. This (hypothetically) protects us from priority inversion, and lets us stop doing a backoff/sleep loop, instead opting for precise wakeups from the mutex. - Cuts down on the number of mutex acquisitions we have to do (from 4 in the worst case to two). We end up simplifying most of the rtree code (which no longer has to deal with locking or concurrency at all), at the cost of additional complexity in the extent code: since the mutex protecting the rtree leaf elements is determined by reading the extent out of those elements, the initial read is racy, so that we may acquire an out of date mutex. We re-check the extent in the leaf after acquiring the mutex to protect us from this race.	2017-05-19 14:21:27 -07:00
David Goldblatt	26c792e61a	Allow mutexes to take a lock ordering enum at construction. This lets us specify whether and how mutexes of the same rank are allowed to be acquired. Currently, we only allow two polices (only a single mutex at a given rank at a time, and mutexes acquired in ascending order), but we can plausibly allow more (e.g. the "release uncontended mutexes before blocking").	2017-05-19 14:21:27 -07:00
Jason Evans	6e62c62862	Refactor decay_time into decay_ms. Support millisecond resolution for decay times. Among other use cases this makes it possible to specify a short initial dirty-->muzzy decay phase, followed by a longer muzzy-->clean decay phase. This resolves #812.	2017-05-18 11:33:45 -07:00
Qi Wang	baf3e294e0	Add stats: arena uptime.	2017-05-18 10:04:28 -07:00
Jason Evans	18a83681cf	Refactor (MALLOCX_ARENA_MAX + 1) to be MALLOCX_ARENA_LIMIT. This resolves #673.	2017-05-14 10:14:23 -07:00
Jason Evans	909f0482e4	Automatically generate private symbol name mangling macros. Rather than using a manually maintained list of internal symbols to drive name mangling, add a compilation phase to automatically extract the list of internal symbols. This resolves #677.	2017-05-11 23:06:54 -07:00
Jason Evans	a4ae9707da	Remove unused private_unnamespace infrastructure.	2017-05-11 23:06:54 -07:00
Jason Evans	a268af5085	Stop depending on JEMALLOC_N() for function interception during testing. Instead, always define function pointers for interceptable functions, but mark them const unless testing, so that the compiler can optimize out the pointer dereferences.	2017-05-11 23:06:54 -07:00
Jason Evans	81ef365622	Avoid compiler warnings on Windows.	2017-05-11 18:06:20 -07:00
Jason Evans	11d2f39d96	Remove mutex_prof_data_t redeclaration. Redeclaration causes compilations failures with e.g. gcc 4.2.1 on FreeBSD. This regression was introduced by `89e2d3c12b` (Header refactoring: ctl - unify and remove from catchall.).	2017-05-11 10:49:43 -07:00
Jason Evans	0798fe6e70	Fix rtree_leaf_elm_szind_slab_update(). Re-read the leaf element when atomic CAS fails due to a race with another thread that has locked the leaf element, since atomic_compare_exchange_strong_p() overwrites the expected value with the actual value on failure. This regression was introduced by `0ee0e0c155` (Implement compact rtree leaf element representation.). This resolves #798.	2017-05-03 08:52:33 -07:00
Jason Evans	344dd342dd	rtree_leaf_elm_extent_write() --> rtree_leaf_elm_extent_lock_write() Refactor rtree_leaf_elm_extent_write() as rtree_leaf_elm_extent_lock_write(), so that whether the leaf element is currently acquired is separate from what lock state to write. This allows for a relaxed atomic read when releasing the lock.	2017-05-03 08:52:33 -07:00
Qi Wang	fc1aaf13fe	Revert "Use trylock in tcache_bin_flush when possible." This reverts commit `8584adc451`. Production results not favorable. Will investigate separately.	2017-05-01 14:49:42 -07:00
David Goldblatt	209f2926b8	Header refactoring: tsd - cleanup and dependency breaking. This removes the tsd macros (which are used only for tsd_t in real builds). We break up the circular dependencies involving tsd. We also move all tsd access through getters and setters. This allows us to assert that we only touch data when tsd is in a valid state. We simplify the usages of the x macro trick, removing all the customizability (get/set, init, cleanup), moving the lifetime logic to tsd_init and tsd_cleanup. This lets us make initialization order independent of order within tsd_t.	2017-05-01 10:49:56 -07:00
Jason Evans	c86c8f4ffb	Add extent_destroy_t and use it during arena destruction. Add the extent_destroy_t extent destruction hook to extent_hooks_t, and use it during arena destruction. This hook explicitly communicates to the callee that the extent must be destroyed or tracked for later reuse, lest it be permanently leaked. Prior to this change, retained extents could unintentionally be leaked if extent retention was enabled. This resolves #560.	2017-04-29 09:24:12 -07:00
Jason Evans	b9ab04a191	Refactor !opt.munmap to opt.retain.	2017-04-29 09:24:12 -07:00
Qi Wang	d901a37775	Revert "Use try_flush first in tcache_dalloc." This reverts commit `b0c2a28280`. Production benchmark shows this caused significant regression in both CPU and memory consumption. Will investigate separately later on.	2017-04-28 10:59:04 -07:00
Qi Wang	b0c2a28280	Use try_flush first in tcache_dalloc. Only do must_flush if try_flush didn't manage to free anything.	2017-04-25 17:21:33 -07:00
Qi Wang	8584adc451	Use trylock in tcache_bin_flush when possible. During tcache gc, use tcache_bin_try_flush_small / _large so that we can skip items with their bins locked already.	2017-04-25 17:21:33 -07:00
Qi Wang	05775a3736	Avoid prof_dump during reentrancy.	2017-04-25 12:54:36 -07:00
David Goldblatt	268843ac68	Header refactoring: pages.h - unify and remove from catchall.	2017-04-25 09:51:38 -07:00
David Goldblatt	dab4beb277	Header refactoring: hash - unify and remove from catchall.	2017-04-25 09:51:38 -07:00
David Goldblatt	89e2d3c12b	Header refactoring: ctl - unify and remove from catchall. In order to do this, we introduce the mutex_prof module, which breaks a circular dependency between ctl and prof.	2017-04-25 09:51:38 -07:00
Jason Evans	c67c3e4a63	Replace --disable-munmap with opt.munmap. Control use of munmap(2) via a run-time option rather than a compile-time option (with the same per platform default). The old behavior of --disable-munmap can be achieved with --with-malloc-conf=munmap:false. This partially resolves #580.	2017-04-24 20:37:16 -07:00
Jason Evans	e2cc6280ed	Remove --enable-code-coverage. This option hasn't been particularly useful since the original pre-3.0.0 push to broaden test coverage. This partially resolves #580.	2017-04-24 16:33:04 -07:00
Jason Evans	0f63396b23	Remove --disable-cc-silence. The explicit compiler warning suppression controlled by this option is universally desirable, so remove the ability to disable suppression. This partially resolves #580.	2017-04-24 15:02:45 -07:00
Qi Wang	f970c497dc	Implement malloc_mutex_trylock() w/ proper stats update.	2017-04-24 13:23:55 -07:00
Jason Evans	af76f0e5d2	Remove --with-lg-tiny-min. This option isn't useful in practice. This partially resolves #580.	2017-04-24 11:48:28 -07:00
David Goldblatt	120c7a747f	Header refactoring: bitmap - unify and remove from catchall.	2017-04-24 10:33:21 -07:00
David Goldblatt	d6b5c7e0f6	Header refactoring: stats - unify and remove from catchall	2017-04-24 10:33:21 -07:00
David Goldblatt	36abf78aa9	Header refactoring: move smoothstep.h out of the catchall.	2017-04-24 10:33:21 -07:00
David Goldblatt	31b43219db	Header refactoring: size_classes module - remove from the catchall	2017-04-24 10:33:21 -07:00
David Goldblatt	68da2361d2	Header refactoring: ckh module - remove from the catchall and unify.	2017-04-24 10:33:21 -07:00
David Goldblatt	bf2dc7e678	Header refactoring: ticker module - remove from the catchall and unify.	2017-04-24 10:33:21 -07:00
David Goldblatt	fa3ad730c4	Header refactoring: prng module - remove from the catchall and unify.	2017-04-24 10:33:21 -07:00
David Goldblatt	4d2e4bf5eb	Get rid of most of the various inline macros.	2017-04-24 10:33:21 -07:00
David Goldblatt	425253e2cd	Enable -Wundef, when supported. This can catch bugs in which one header defines a numeric constant, and another uses it without including the defining header. Undefined preprocessor symbols expand to '0', so that this will compile fine, silently doing the math wrong.	2017-04-21 17:03:56 -07:00
Jason Evans	3823effe12	Remove --enable-ivsalloc. Continue to use ivsalloc() when --enable-debug is specified (and add assertions to guard against 0 size), but stop providing a documented explicit semantics-changing band-aid to dodge undefined behavior in sallocx() and malloc_usable_size(). ivsalloc() remains compiled in, unlike when #211 restored --enable-ivsalloc, and if JEMALLOC_FORCE_IVSALLOC is defined during compilation, sallocx() and malloc_usable_size() will still use ivsalloc(). This partially resolves #580.	2017-04-21 14:34:35 -07:00
Jim Chen	ae248a2160	Use openat syscall if available Some architectures like AArch64 may not have the open syscall because it was superseded by the openat syscall, so check and use SYS_openat if SYS_open is not available. Additionally, Android headers for AArch64 define SYS_open to __NR_open, even though __NR_open is undefined. Undefine SYS_open in that case so SYS_openat is used.	2017-04-21 10:58:42 -07:00
Jason Evans	4403c9ab44	Remove --disable-tcache. Simplify configuration by removing the --disable-tcache option, but replace the testing for that configuration with --with-malloc-conf=tcache:false. Fix the thread.arena and thread.tcache.flush mallctls to work correctly if tcache is disabled. This partially resolves #580.	2017-04-21 10:06:12 -07:00
Qi Wang	5aa46f027d	Bypass extent tracking for auto arenas. Tracking extents is required by arena_reset. To support this, the extent linkage was used for tracking 1) large allocations, and 2) full slabs. However modifying the extent linkage could be an expensive operation as it likely incurs cache misses. Since we forbid arena_reset on auto arenas, let's bypass the linkage operations for auto arenas.	2017-04-21 00:29:18 -07:00
Jason Evans	da4cff0279	Support --with-lg-page values larger than system page size. All mappings continue to be PAGE-aligned, even if the system page size is smaller. This change is primarily intended to provide a mechanism for supporting multiple page sizes with the same binary; smaller page sizes work better in conjunction with jemalloc's design. This resolves #467.	2017-04-18 19:01:04 -07:00
Jason Evans	45f087eb03	Revert "Remove BITMAP_USE_TREE." Some systems use a native 64 KiB page size, which means that the bitmap for the smallest size class can be 8192 bits, not just 512 bits as when the page size is 4 KiB. Linear search in bitmap_{sfu,ffu}() is unacceptably slow for such large bitmaps. This reverts commit `7c00f04ff4`.	2017-04-18 19:01:04 -07:00
David Goldblatt	38e847c1c5	Header refactoring: unify spin.h and move it out of the catch-all.	2017-04-18 18:35:03 -07:00
David Goldblatt	418d96a86c	Header refactoring: unify nstime.h and move it out of the catch-all	2017-04-18 18:35:03 -07:00
David Goldblatt	7ebc83894f	Header refactoring: move jemalloc_internal_types.h out of the catch-all	2017-04-18 18:35:03 -07:00
David Goldblatt	d9ec36e22d	Header refactoring: move assert.h out of the catch-all	2017-04-18 18:35:03 -07:00
David Goldblatt	f692e6c214	Header refactoring: move util.h out of the catchall	2017-04-18 18:35:03 -07:00
David Goldblatt	54373be084	Header refactoring: move malloc_io.h out of the catchall	2017-04-18 18:35:03 -07:00
David Goldblatt	0b00ffe55f	Header refactoring: move bit_util.h out of the catchall	2017-04-18 18:35:03 -07:00
David Goldblatt	22366518b7	Move CPP_PROLOGUE and CPP_EPILOGUE to the .cpp This lets us avoid having to specify them in every C file.	2017-04-18 18:35:03 -07:00
Jason Evans	881fbf762f	Prefer old/low extent_t structures during reuse. Rather than using a LIFO queue to track available extent_t structures, use a red-black tree, and always choose the oldest/lowest available during reuse.	2017-04-17 14:47:45 -07:00
Jason Evans	76b35f4b2f	Track extent structure serial number (esn) in extent_t. This enables stable sorting of extent_t structures.	2017-04-17 14:47:45 -07:00
Jason Evans	69aa552809	Allocate increasingly large base blocks. Limit the total number of base block by leveraging the exponential size class sequence, similarly to extent_grow_retained().	2017-04-17 14:47:45 -07:00
Qi Wang	3c9c41edb2	Improve rtree cache with a two-level cache design. Two levels of rcache is implemented: a direct mapped cache as L1, combined with a LRU cache as L2. The L1 cache offers low cost on cache hit, but could suffer collision under circumstances. This is complemented by the L2 LRU cache, which is slower on cache access (overhead from linear search + reordering), but solves collison of L1 rather well.	2017-04-17 12:05:23 -07:00
Qi Wang	d16f1e53df	Skip percpu arena when choosing iarena.	2017-04-16 21:34:44 -07:00
Qi Wang	c2fcf9c2cf	Switch to fine-grained reentrancy support. Previously we had a general detection and support of reentrancy, at the cost of having branches and inc / dec operations on fast paths. To avoid taxing fast paths, we move the reentrancy operations onto tsd slow state, and only modify reentrancy level around external calls (that might trigger reentrancy).	2017-04-14 19:48:06 -07:00
Qi Wang	b348ba29bb	Bundle 3 branches on fast path into tsd_state. Added tsd_state_nominal_slow, which on fast path malloc() incorporates tcache_enabled check, and on fast path free() bundles both malloc_slow and tcache_enabled branches.	2017-04-14 16:58:08 -07:00
Qi Wang	ccfe68a916	Pass alloc_ctx down profiling path. With this change, when profiling is enabled, we avoid doing redundant rtree lookups. Also changed dalloc_atx_t to alloc_atx_t, as it's now used on allocation path as well (to speed up profiling).	2017-04-12 13:55:39 -07:00
Qi Wang	f35213bae4	Pass dalloc_ctx down the sdalloc path. This avoids redundant rtree lookups.	2017-04-12 13:55:39 -07:00
David Goldblatt	e709fae1d7	Header refactoring: move atomic.h out of the catch-all	2017-04-11 11:52:30 -07:00
David Goldblatt	743d940dc3	Header refactoring: Split up jemalloc_internal.h This is a biggy. jemalloc_internal.h has been doing multiple jobs for a while now: - The source of system-wide definitions. - The catch-all include file. - The module header file for jemalloc.c This commit splits up this functionality. The system-wide definitions responsibility has moved to jemalloc_preamble.h. The catch-all include file is now jemalloc_internal_includes.h. The module headers for jemalloc.c are now in jemalloc_internal_[externs\|inlines\|types].h, just as they are for the other modules.	2017-04-11 11:52:30 -07:00
David Goldblatt	0237870c60	Header refactoring: break out ql.h dependencies	2017-04-11 11:52:30 -07:00
David Goldblatt	610cb83419	Header refactoring: break out qr.h dependencies	2017-04-11 11:52:30 -07:00
David Goldblatt	63a5cd4cc2	Header refactoring: break out rb.h dependencies	2017-04-11 11:52:30 -07:00
David Goldblatt	2f00ce4da7	Header refactoring: break out ph.h dependencies	2017-04-11 11:52:30 -07:00
David Goldblatt	57e36e1a12	Header refactoring: Add CPP_PROLOGUE and CPP_EPILOGUE macros	2017-04-11 11:52:30 -07:00
Qi Wang	bfa530b75b	Pass dealloc_ctx down free() fast path. This gets rid of the redundent rtree lookup down fast path.	2017-04-11 09:58:12 -07:00
Qi Wang	04ef218d87	Move reentrancy_level to the beginning of TSD.	2017-04-07 16:25:43 -07:00
David Goldblatt	b407a65401	Add basic reentrancy-checking support, and allow arena_new to reenter. This checks whether or not we're reentrant using thread-local data, and, if we are, moves certain internal allocations to use arena 0 (which should be properly initialized after bootstrapping). The immediate thing this allows is spinning up threads in arena_new, which will enable spinning up background threads there.	2017-04-07 14:10:27 -07:00
David Goldblatt	0a0fcd3e6a	Add hooking functionality This allows us to hook chosen functions and do interesting things there (in particular: reentrancy checking).	2017-04-07 14:10:27 -07:00
Qi Wang	36bd90b962	Optimizing TSD and thread cache layout. 1) Re-organize TSD so that frequently accessed fields are closer to the beginning and more compact. Assuming 64-bit, the first 2.5 cachelines now contains everything needed on tcache fast path, expect the tcache struct itself. 2) Re-organize tcache and tbins. Take lg_fill_div out of tbin, and reduce tbin to 24 bytes (down from 32). Split tbins into tbins_small and tbins_large, and place tbins_small close to the beginning.	2017-04-07 14:06:17 -07:00
Qi Wang	0fba57e579	Get rid of tcache_enabled_t as we have runtime init support.	2017-04-07 10:42:29 -07:00
Qi Wang	fde3e20cc0	Integrate auto tcache into TSD. The embedded tcache is initialized upon tsd initialization. The avail arrays for the tbins will be allocated / deallocated accordingly during init / cleanup. With this change, the pointer to the auto tcache will always be available, as long as we have access to the TSD. tcache_available() (called in tcache_get()) is provided to check if we should use tcache.	2017-04-07 09:55:14 -07:00
David Goldblatt	eeabdd2466	Remove the pre-C11-atomics API, which is now unused	2017-04-05 16:25:37 -07:00
David Goldblatt	5dcc13b342	Make the mutex n_waiting_thds field a C11-style atomic	2017-04-05 16:25:37 -07:00
David Goldblatt	30d74db08e	Convert accumbytes in prof_accum_t to C11 atomics, when possible	2017-04-05 16:25:37 -07:00
David Goldblatt	92aafb0efe	Make base_t's extent_hooks field C11-atomic	2017-04-05 16:25:37 -07:00
David Goldblatt	56b72c7b17	Transition arena struct fields to C11 atomics	2017-04-05 16:25:37 -07:00
David Goldblatt	bc32ec3503	Move arena-tracking atomics in jemalloc.c to C11-style	2017-04-05 16:25:37 -07:00
David Goldblatt	864adb7f42	Transition e_prof_tctx in struct extent to C11 atomics	2017-04-04 16:46:04 -07:00
David Goldblatt	7da04a6b09	Convert prng module to use C11-style atomics	2017-04-04 16:45:52 -07:00
Qi Wang	492e9f301e	Make the tsd member init functions to take tsd_t * type.	2017-04-04 14:06:07 -07:00
Qi Wang	d3cda3423c	Do proper cleanup for tsd_state_reincarnated. Also enable arena_bind under non-nominal state, as the cleanup will be handled correctly now.	2017-04-04 00:34:49 -07:00
Qi Wang	51d3682950	Remove the leafkey NULL check in leaf_elm_lookup.	2017-04-04 00:27:35 -07:00
Qi Wang	9ed84b0d45	Add init function support to tsd members. This will facilitate embedding tcache into tsd, which will require proper initialization cannot be done via the static initializer. Make tsd->rtree_ctx to be initialized via rtree_ctx_data_init().	2017-04-04 00:19:21 -07:00
Jason Evans	07f4f93434	Move arena_slab_data_t's nfree into extent_t's e_bits. Compact extent_t to 128 bytes on 64-bit systems by moving arena_slab_data_t's nfree into extent_t's e_bits. Cacheline-align extent_t structures so that they always cross the minimum number of cacheline boundaries. Re-order extent_t fields such that all fields except the slab bitmap (and overlaid heap profiling context pointer) are in the first cacheline. This resolves #461.	2017-03-27 22:43:39 -07:00
Qi Wang	af3d737a9a	Simplify rtree cache replacement policy. To avoid memmove on free() fast path, simplify the cache replacement policy to only bubble up the cache hit element by 1.	2017-03-27 13:42:31 -07:00
Jason Evans	c6d1819e48	Simplify rtree_clear() to avoid locking.	2017-03-27 13:22:52 -07:00
Jason Evans	4020523f67	Fix a race in rtree_szind_slab_update() for RTREE_LEAF_COMPACT.	2017-03-27 13:22:36 -07:00
Jason Evans	7c00f04ff4	Remove BITMAP_USE_TREE. Remove tree-structured bitmap support, in order to reduce complexity and ease maintenance. No bitmaps larger than 512 bits have been necessary since before 4.0.0, and there is no current plan that would increase maximum bitmap size. Although tree-structured bitmaps were used on 32-bit platforms prior to this change, the overall benefits were questionable (higher metadata overhead, higher bitmap modification cost, marginally lower search cost).	2017-03-27 12:18:40 -07:00
Jason Evans	6258176c87	Fix bitmap_ffu() to work with 3+ levels.	2017-03-27 12:18:40 -07:00
Jason Evans	735ad8210c	Pack various extent_t fields into a bitfield. This reduces sizeof(extent_t) from 160 to 136 on x64.	2017-03-25 23:30:13 -07:00
Jason Evans	0591c204b4	Store arena index rather than (arena_t *) in extent_t.	2017-03-25 23:30:13 -07:00
Jason Evans	5e12223925	Fix BITMAP_USE_TREE version of bitmap_ffu(). This fixes an extent searching regression on 32-bit systems, caused by the initial bitmap_ffu() implementation in `c8021d01f6` (Implement bitmap_ffu(), which finds the first unset bit.), as first used in `5d33233a5e` (Use a bitmap in extents_t to speed up search.).	2017-03-25 23:29:32 -07:00
Jason Evans	5d33233a5e	Use a bitmap in extents_t to speed up search. Rather than iteratively checking all sufficiently large heaps during search, maintain and use a bitmap in order to skip empty heaps.	2017-03-24 17:52:46 -07:00
Jason Evans	57e353163f	Implement BITMAP_GROUPS().	2017-03-24 17:52:46 -07:00
Jason Evans	c8021d01f6	Implement bitmap_ffu(), which finds the first unset bit.	2017-03-24 17:52:46 -07:00
Qi Wang	362e356675	Profile per arena base mutex, instead of just a0.	2017-03-23 00:03:28 -07:00
Qi Wang	d3fde1c124	Refactor mutex profiling code with x-macros.	2017-03-23 00:03:28 -07:00
Qi Wang	f6698ec1e6	Switch to nstime_t for the time related fields in mutex profiling.	2017-03-23 00:03:28 -07:00
Qi Wang	74f78cafda	Added custom mutex spin. A fixed max spin count is used -- with benchmark results showing it solves almost all problems. As the benchmark used was rather intense, the upper bound could be a little bit high. However it should offer a good tradeoff between spinning and blocking.	2017-03-23 00:03:28 -07:00
Qi Wang	20b8c70e9f	Added extents_dirty / _muzzy mutexes, as well as decay_dirty / _muzzy.	2017-03-23 00:03:28 -07:00
Qi Wang	64c5f5c174	Added "stats.mutexes.reset" mallctl to reset all mutex stats. Also switched from the term "lock" to "mutex".	2017-03-23 00:03:28 -07:00
Qi Wang	ca9074deff	Added lock profiling and output for global locks (ctl, prof and base).	2017-03-23 00:03:28 -07:00
Qi Wang	0fb5c0e853	Add arena lock stats output.	2017-03-23 00:03:28 -07:00
Qi Wang	a4f176af57	Output bin lock profiling results to malloc_stats. Two counters are included for the small bins: lock contention rate, and max lock waiting time.	2017-03-23 00:03:28 -07:00
Qi Wang	6309df628f	First stage of mutex profiling. Switched to trylock and update counters based on state.	2017-03-23 00:03:28 -07:00
Jason Evans	32e7cf51cd	Further specialize arena_[s]dalloc() tcache fast path. Use tsd_rtree_ctx() rather than tsdn_rtree_ctx() when tcache is non-NULL, in order to avoid an extra branch (and potentially extra stack space) in the fast path.	2017-03-22 18:33:32 -07:00
Jason Evans	5e67fbc367	Push down iealloc() calls. Call iealloc() as deep into call chains as possible without causing redundant calls.	2017-03-22 18:33:32 -07:00
Jason Evans	51a2ec92a1	Remove extent dereferences from the deallocation fast paths.	2017-03-22 18:33:32 -07:00
Jason Evans	4f341412e5	Remove extent arg from isalloc() and arena_salloc().	2017-03-22 18:33:32 -07:00
Jason Evans	0ee0e0c155	Implement compact rtree leaf element representation. If a single virtual adddress pointer has enough unused bits to pack {szind_t, extent_t *, bool, bool}, use a single pointer-sized field in each rtree leaf element, rather than using three separate fields. This has little impact on access speed (fewer loads/stores, but more bit twiddling), except that denser representation increases TLB effectiveness.	2017-03-22 18:33:32 -07:00
Jason Evans	ce41ab0c57	Embed root node into rtree_t. This avoids one atomic operation per tree access.	2017-03-22 18:33:32 -07:00
Jason Evans	99d68445ef	Incorporate szind/slab into rtree leaves. Expand and restructure the rtree API such that all common operations can be achieved with minimal work, regardless of whether the rtree leaf fields are independent versus packed into a single atomic pointer.	2017-03-22 18:33:32 -07:00
Jason Evans	944c8a3383	Split rtree_elm_t into rtree_{node,leaf}_elm_t. This allows leaf elements to differ in size from internal node elements. In principle it would be more correct to use a different type for each level of the tree, but due to implementation details related to atomic operations, we use casts anyway, thus counteracting the value of additional type correctness. Furthermore, such a scheme would require function code generation (via cpp macros), as well as either unwieldy type names for leaves or type aliases, e.g. typedef struct rtree_elm_d2_s rtree_leaf_elm_t; This alternate strategy would be more correct, and with less code duplication, but probably not worth the complexity.	2017-03-22 18:33:32 -07:00
Jason Evans	f50d6009fe	Remove binind field from arena_slab_data_t. binind is now redundant; the containing extent_t's szind field always provides the same value.	2017-03-22 18:33:32 -07:00
Jason Evans	e8921cf2eb	Convert extent_t's usize to szind. Rather than storing usize only for large (and prof-promoted) allocations, store the size class index for allocations that reside within the extent, such that the size class index is valid for all extents that contain extant allocations, and invalid otherwise (mainly to make debugging simpler).	2017-03-22 18:33:32 -07:00
Jason Evans	64e458f5cd	Implement two-phase decay-based purging. Split decay-based purging into two phases, the first of which uses lazy purging to convert dirty pages to "muzzy", and the second of which uses forced purging, decommit, or unmapping to convert pages to clean or destroy them altogether. Not all operating systems support lazy purging, yet the application may provide extent hooks that implement lazy purging, so care must be taken to dynamically omit the first phase when necessary. The mallctl interfaces change as follows: - opt.decay_time --> opt.{dirty,muzzy}_decay_time - arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time - arenas.decay_time --> arenas.{dirty,muzzy}_decay_time - stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy} - stats.arenas.<i>.{npurge,nmadvise,purged} --> stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged} This resolves #521.	2017-03-15 13:13:47 -07:00
Jason Evans	38a5bfc816	Move arena_t's purging field into arena_decay_t.	2017-03-15 13:13:47 -07:00
Jason Evans	765edd67b4	Refactor decay-related function parametrization. Refactor most of the decay-related functions to take as parameters the decay_t and associated extents_t structures to operate on. This prepares for supporting both lazy and forced purging on different decay schedules.	2017-03-15 13:13:47 -07:00
David Goldblatt	ee202efc79	Convert remaining arena_stats_t fields to atomics These were all size_ts, so we have atomics support for them on all platforms, so the conversion is straightforward. Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.	2017-03-13 18:22:33 -07:00
David Goldblatt	4fc2acf5ae	Switch atomic uint64_ts in arena_stats_t to C11 atomics I expect this to be the trickiest conversion we will see, since we want atomics on 64-bit platforms, but are also always able to piggyback on some sort of external synchronization on non-64 bit platforms.	2017-03-13 18:22:33 -07:00
Jason Evans	7cbcd2e2b7	Fix pages_purge_forced() to discard pages on non-Linux systems. madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall back to overlaying a new mapping.	2017-03-13 18:19:57 -07:00
David Goldblatt	21a68e2d22	Convert rtree code to use C11 atomics In the process, I changed the implementation of rtree_elm_acquire so that it won't even try to CAS if its initial read (getting the extent + lock bit) indicates that the CAS is doomed to fail. This can significantly improve performance under contention.	2017-03-13 12:05:27 -07:00
Jason Evans	3a2b183d5f	Convert arena_t's purging field to non-atomic bool. The decay mutex already protects all accesses.	2017-03-10 10:14:30 -08:00
Jason Evans	75fddc786c	Fix ATOMIC_{ACQUIRE,RELEASE,ACQ_REL} definitions.	2017-03-09 00:57:37 -08:00
Qi Wang	ec532e2c5c	Implement per-CPU arena. The new feature, opt.percpu_arena, determines thread-arena association dynamically based CPU id. Three modes are supported: "percpu", "phycpu" and disabled. "percpu" uses the current core id (with help from sched_getcpu()) directly as the arena index, while "phycpu" will assign threads on the same physical CPU to the same arena. In other words, "percpu" means # of arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of CPUs). Note that no runtime check on whether hyper threading is enabled is added yet. When enabled, threads will be migrated between arenas when a CPU change is detected. In the current design, to reduce overhead from reading CPU id, each arena tracks the thread accessed most recently. When a new thread comes in, we will read CPU id and update arena if necessary.	2017-03-08 23:19:01 -08:00
Qi Wang	8721e19c04	Fix arena_prefork lock rank order for witness. When witness is enabled, lock rank order needs to be preserved during prefork, not only for each arena, but also across arenas. This change breaks arena_prefork into further stages to ensure valid rank order across arenas. Also changed test/unit/fork to use a manual arena to catch this case.	2017-03-08 23:07:27 -08:00
David Goldblatt	8adab26972	Convert extents_t's npages field to use C11-style atomics In the process, we can do some strength reduction, changing the fetch-adds and fetch-subs to be simple loads followed by stores, since the modifications all occur while holding the mutex.	2017-03-08 21:27:09 -08:00
David Goldblatt	dafadce622	Reintroduce JEMALLOC_ATOMIC_U64 The C11 atomics backport removed this #define, which degraded atomic 64-bit reads to require a lock even on platforms that support them. This commit fixes that.	2017-03-08 21:26:37 -08:00
Qi Wang	01f47f11a6	Store associated arena in tcache. This fixes tcache_flush for manual tcaches, which wasn't able to find the correct arena it associated with. Also changed the decay test to cover this case (by using manually created arenas).	2017-03-07 12:58:11 -08:00
Jason Evans	cc75c35db5	Add any() and remove_any() to ph. These functions select the easiest-to-remove element in the heap, which is either the most recently inserted aux list element or the root. If no calls are made to first() or remove_first(), the behavior (and time complexity) is the same as for a LIFO queue.	2017-03-07 10:25:33 -08:00
Jason Evans	e201e24904	Perform delayed coalescing prior to purging. Rather than purging uncoalesced extents, perform just enough incremental coalescing to purge only fully coalesced extents. In the absence of cached extent reuse, the immediate versus delayed incremental purging algorithms result in the same purge order. This resolves #655.	2017-03-07 10:25:12 -08:00
David Goldblatt	4f1e94658a	Change arena to use the atomic functions for ssize_t instead of the union strategy	2017-03-06 18:49:19 -08:00
David Goldblatt	438efede78	Add atomic types for ssize_t	2017-03-06 18:49:19 -08:00
David Goldblatt	424e3428b1	Make type abbreviations consistent: ssize_t is zd everywhere	2017-03-06 18:49:19 -08:00
David Goldblatt	84326c566a	Insert not_reached after an exhaustive switch In the C11 atomics backport, we couldn't use not_reached() in atomic_enum_to_builtin (in atomic_gcc_atomic.h), since atomic.h was hermetic and assert.h wasn't; there was a dependency issue. assert.h is hermetic now, so we can include it.	2017-03-06 15:08:43 -08:00
David Goldblatt	e9852b5776	Disentangle assert and util This is the first header refactoring diff, #533. It splits the assert and util components into separate, hermetic, header files. In the process, it splits out two of the large sub-components of util (the stdio.h replacement, and bit manipulation routines) into their own components (malloc_io.h and bit_util.h). This is mostly to break up cyclic dependencies, but it also breaks off a good chunk of the catch-all-ness of util, which is nice.	2017-03-06 15:08:43 -08:00
Jason Evans	04d8fcb745	Optimize malloc_large_stats_t maintenance. Convert the nrequests field to be partially derived, and the curlextents to be fully derived, in order to reduce the number of stats updates needed during common operations. This change affects ndalloc stats during arena reset, because it is no longer possible to cancel out ndalloc effects (curlextents would become negative).	2017-03-04 08:18:31 -08:00
David Goldblatt	d4ac7582f3	Introduce a backport of C11 atomics This introduces a backport of C11 atomics. It has four implementations; ranked in order of preference, they are: - GCC/Clang __atomic builtins - GCC/Clang __sync builtins - MSVC _Interlocked builtins - C11 atomics, from <stdatomic.h> The primary advantages are: - Close adherence to the standard API gives us a defined memory model. - Type safety: atomic objects are now separate types from non-atomic ones, so that it's impossible to mix up atomic and non-atomic updates (which is undefined behavior that compilers are starting to take advantage of). - Efficiency: we can specify ordering for operations, avoiding fences and atomic operations on strongly ordered architectures (example: `atomic_write_u32(ptr, val);` involves a CAS loop, whereas `atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store. This diff leaves in the current atomics API (implementing them in terms of the backport). This lets us transition uses over piecemeal. Testing: This is by nature hard to test. I've manually tested the first three options on Linux on gcc by futzing with the #defines manually, on freebsd with gcc and clang, on MSVC, and on OS X with clang. All of these were x86 machines though, and we don't have any test infrastructure set up for non-x86 platforms.	2017-03-03 13:40:59 -08:00
David Goldblatt	957b8c5f21	Stop #define-ining away 'inline' In the long term, we'll transition to C99-style inline semantics. In the short-term, this will allow both styles to coexist without breaking one another.	2017-03-03 13:40:59 -08:00
Jason Evans	fd058f572b	Immediately purge cached extents if decay_time is 0. This fixes a regression caused by `54269dc0ed` (Remove obsolete arena_maybe_purge() call.), as well as providing a general fix. This resolves #665.	2017-03-02 19:43:06 -08:00
Jason Evans	d61a5f76b2	Convert arena_decay_t's time to be atomically synchronized.	2017-03-02 19:43:06 -08:00
Jason Evans	472fef2e12	Fix {allocated,nmalloc,ndalloc,nrequests}_large stats regression. This fixes a regression introduced by `d433471f58` (Derive {allocated,nmalloc,ndalloc,nrequests}_large stats.).	2017-02-27 11:18:07 -08:00
Jason Evans	079b8bee37	Tidy up extent quantization. Remove obsolete unit test scaffolding for extent quantization. Remove redundant assertions. Add an assertion to extents_first_best_fit_locked() that should help prevent aligned allocation regressions.	2017-02-27 11:17:47 -08:00
Jason Evans	d727596bcb	Update a comment.	2017-02-26 11:05:27 -08:00
Qi Wang	c2323e13a5	Get rid of witness in malloc_mutex_t when !(configured w/ debug). We don't touch witness at all when config_debug == false. Let's only pay the memory cost in malloc_mutex_s when needed. Note that when !config_debug, we keep the field in a union so that we don't have to do #ifdefs in multiple places.	2017-02-24 09:41:29 -08:00
Jason Evans	8ac7937eb5	Remove remainder of mb (memory barrier). This complements `94c5d22a4d` (Remove mb.h, which is unused).	2017-02-22 00:24:14 -08:00
Jason Evans	003ca8717f	Move arena_basic_stats_merge() prototype (hygienic cleanup).	2017-02-21 12:46:20 -08:00
Jason Evans	2dfc5b5aac	Disable coalescing of cached extents. Extent splitting and coalescing is a major component of large allocation overhead, and disabling coalescing of cached extents provides a simple and effective hysteresis mechanism. Once two-phase purging is implemented, it will probably make sense to leave coalescing disabled for the first phase, but coalesce during the second phase.	2017-02-16 20:11:50 -08:00
Jason Evans	b0654b95ed	Fix arena->stats.mapped accounting. Mapped memory increases when extent_alloc_wrapper() succeeds, and decreases when extent_dalloc_wrapper() is called (during purging).	2017-02-16 15:52:11 -08:00
Jason Evans	f8fee6908d	Synchronize arena->decay with arena->decay.mtx. This removes the last use of arena->lock.	2017-02-16 09:39:46 -08:00
Jason Evans	d433471f58	Derive {allocated,nmalloc,ndalloc,nrequests}_large stats. This mildly reduces stats update overhead during normal operation.	2017-02-16 09:39:46 -08:00
Jason Evans	ab25d3c987	Synchronize arena->tcache_ql with arena->tcache_ql_mtx. This replaces arena->lock synchronization.	2017-02-16 09:39:46 -08:00
Jason Evans	6b5cba4191	Convert arena->stats synchronization to atomics.	2017-02-16 09:39:46 -08:00
Jason Evans	fa2d64c94b	Convert arena->prof_accumbytes synchronization to atomics.	2017-02-16 09:39:46 -08:00
Jason Evans	b779522b9b	Convert arena->dss_prec synchronization to atomics.	2017-02-16 09:39:46 -08:00
Jason Evans	0721b895ff	Do not generate unused tsd_*_[gs]et() functions. This avoids a gcc diagnostic note: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6 This note related to the cacheline alignment of rtree_ctx_t, which was introduced by `4a346f5593` (Replace rtree path cache with LRU cache.).	2017-02-13 10:47:16 -08:00
Jason Evans	6b8ef771a9	Fix rtree_subkey() regression. Fix rtree_subkey() to use uintptr_t rather than unsigned for key bitmasking. This regression was introduced by `4a346f5593` (Replace rtree path cache with LRU cache.).	2017-02-10 09:05:02 -08:00
Jason Evans	7f55dbef9b	Enable mutex witnesses even when !isthreaded. This fixes interactions with witness_assert_depth[_to_rank](), which was added in `d0e93ada51` (Add witness_assert_depth[_to_rank]().).	2017-02-09 17:05:47 -08:00
Jason Evans	db7da56359	Spin adaptively in rtree_elm_acquire().	2017-02-08 18:50:03 -08:00
Jason Evans	de8a68e853	Enhance spin_adaptive() to yield after several iterations. This avoids worst case behavior if e.g. another thread is preempted while owning the resource the spinning thread is waiting for.	2017-02-08 18:50:03 -08:00
Jason Evans	5f11830754	Replace spin_init() with SPIN_INITIALIZER.	2017-02-08 18:50:03 -08:00
Jason Evans	650c070e10	Remove rtree support for 0 (NULL) keys. NULL can never actually be inserted in practice, and removing support allows a branch to be removed from the fast path.	2017-02-08 18:50:03 -08:00
Jason Evans	f5cf9b19c8	Determine rtree levels at compile time. Rather than dynamically building a table to aid per level computations, define a constant table at compile time. Omit both high and low insignificant bits. Use one to three tree levels, depending on the number of significant bits.	2017-02-08 18:50:03 -08:00
Jason Evans	ff4db5014e	Remove rtree leading 0 bit optimization. A subsequent change instead ignores insignificant high bits.	2017-02-08 18:50:03 -08:00
Jason Evans	cdc240d501	Make non-essential inline rtree functions static functions.	2017-02-08 18:50:03 -08:00
Jason Evans	c511a44e99	Split rtree_elm_lookup_hard() out of rtree_elm_lookup(). Anything but a hit in the first element of the lookup cache is expensive enough to negate the benefits of inlining.	2017-02-08 18:50:03 -08:00
Jason Evans	4a346f5593	Replace rtree path cache with LRU cache. Rework rtree_ctx_t to encapsulate an rtree leaf LRU lookup cache rather than a single-path element lookup cache. The replacement is logically much simpler, as well as slightly faster in the fast path case and less prone to degraded performance during non-trivial sequences of lookups.	2017-02-08 18:50:03 -08:00
Jason Evans	0ecf692726	Optimize a branch out of rtree_read() if !dependent.	2017-02-08 18:50:03 -08:00
Jason Evans	d27f29b468	Disentangle arena and extent locking. Refactor arena and extent locking protocols such that arena and extent locks are never held when calling into the extent_*_wrapper() API. This requires extra care during purging since the arena lock no longer protects the inner purging logic. It also requires extra care to protect extents from being merged with adjacent extents. Convert extent_t's 'active' flag to an enumerated 'state', so that retained extents are explicitly marked as such, rather than depending on ring linkage state. Refactor the extent collections (and their synchronization) for cached and retained extents into extents_t. Incorporate LRU functionality to support purging. Incorporate page count accounting, which replaces arena->ndirty and arena->stats.retained. Assert that no core locks are held when entering any internal [de]allocation functions. This is in addition to existing assertions that no locks are held when entering external [de]allocation functions. Audit and document synchronization protocols for all arena_t fields. This fixes a potential deadlock due to recursive allocation during gdump, in a similar fashion to `b49c649bc1` (Fix lock order reversal during gdump.), but with a necessarily much broader code impact.	2017-02-01 16:43:46 -08:00
Jason Evans	1b6e43507e	Fix/refactor tcaches synchronization. Synchronize tcaches with tcaches_mtx rather than ctl_mtx. Add missing synchronization for tcache flushing. This bug was introduced by `1cb181ed63` (Implement explicit tcache support.), which was first released in 4.0.0.	2017-02-01 16:43:46 -08:00
Jason Evans	d0e93ada51	Add witness_assert_depth[_to_rank](). This makes it possible to make lock state assertions about precisely which locks are held.	2017-02-01 16:43:46 -08:00
Jason Evans	c0cc5db871	Replace tabs following #define with spaces. This resolves #564.	2017-01-20 21:45:53 -08:00
Jason Evans	f408643a4c	Remove extraneous parens around return arguments. This resolves #540.	2017-01-20 21:43:07 -08:00
Jason Evans	c4c2592c83	Update brace style. Add braces around single-line blocks, and remove line breaks before function-opening braces. This resolves #537.	2017-01-20 21:43:07 -08:00
Jason Evans	9eb1b1c881	Fix --disable-stats support. Fix numerous regressions that were exposed by --disable-stats, both in the core library and in the tests.	2017-01-19 18:31:07 -08:00
Qi Wang	58424e679d	Added stats about number of bytes cached in tcache currently.	2017-01-18 10:55:21 -08:00
Mike Hommey	0f7376eb62	Don't rely on OSX SDK malloc/malloc.h for malloc_zone struct definitions The SDK jemalloc is built against might be not be the latest for various reasons, but the resulting binary ought to work on newer versions of OSX. In order to ensure this, we need the fullest definitions possible, so copy what we need from the latest version of malloc/malloc.h available on opensource.apple.com.	2017-01-17 20:13:28 -08:00
Jason Evans	1ff09534b5	Fix prof_realloc() regression. Mostly revert the prof_realloc() changes in `498856f44a` (Move slabs out of chunks.) so that prof_free_sampled_object() is called when appropriate. Leave the prof_tctx_[re]set() optimization in place, but add an assertion to verify that all eight cases are correctly handled. Add a comment to make clear the code ordering, so that the regression originally fixed by `ea8d97b897` (Fix prof_{malloc,free}_sample_object() call order in prof_realloc().) is not repeated. This resolves #499.	2017-01-17 15:16:37 -08:00
Jason Evans	ffbb7dac3d	Remove leading blank lines from function bodies. This resolves #535.	2017-01-13 14:49:24 -08:00
David Goldblatt	77cccac8cd	Break up headers into constituent parts This is part of a broader change to make header files better represent the dependencies between one another (see https://github.com/jemalloc/jemalloc/issues/533). It breaks up component headers into smaller parts that can be made to have a simpler dependency graph. For the autogenerated headers (smoothstep.h and size_classes.h), no splitting was necessary, so I didn't add support to emit multiple headers.	2017-01-12 15:43:51 -08:00
David Goldblatt	94c5d22a4d	Remove mb.h, which is unused	2017-01-11 13:24:30 -08:00
John Paul Adrian Glaubitz	77de5f27d8	Use better pre-processor defines for sparc64 Currently, jemalloc detects sparc64 targets by checking whether __sparc64__ is defined. However, this definition is used on BSD targets only. Linux targets define both __sparc__ and __arch64__ for sparc64. Since this also works on BSD, rather use __sparc__ and __arch64__ instead of __sparc64__ to detect sparc64 targets.	2017-01-10 17:39:54 -08:00
Jason Evans	edf1bafb2b	Implement arena.<i>.destroy . Add MALLCTL_ARENAS_DESTROYED for accessing destroyed arena stats as an analogue to MALLCTL_ARENAS_ALL. This resolves #382.	2017-01-06 18:58:46 -08:00
Jason Evans	6edbedd916	Range-check mib[1] --> arena_ind casts.	2017-01-06 18:58:46 -08:00
Jason Evans	c0a05e6aba	Move static ctl_epoch variable into ctl_stats_t (as epoch).	2017-01-06 18:58:45 -08:00
Jason Evans	d778dd2afc	Refactor ctl_stats_t. Refactor ctl_stats_t to be a demand-zeroed non-growing data structure. To keep the size from being onerous (~60 MiB) on 32-bit systems, convert the arenas field to contain pointers rather than directly embedded ctl_arena_stats_t elements.	2017-01-06 18:58:45 -08:00
Jason Evans	0f04bb1d6f	Rename the arenas.extend mallctl to arenas.create.	2017-01-06 18:58:45 -08:00
Jason Evans	3dc4e83ccb	Add MALLCTL_ARENAS_ALL. Add the MALLCTL_ARENAS_ALL cpp macro as a fixed index for use in accessing the arena.<i>.{purge,decay,dss} and stats.arenas.<i>.* mallctls, and deprecate access via the arenas.narenas index (to be removed in 6.0.0).	2017-01-06 18:58:45 -08:00
Jason Evans	a0dd3a4483	Implement per arena base allocators. Add/rename related mallctls: - Add stats.arenas.<i>.base . - Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal . - Add stats.arenas.<i>.resident . Modify the arenas.extend mallctl to take an optional (extent_hooks_t *) argument so that it is possible for all base allocations to be serviced by the specified extent hooks. This resolves #463.	2016-12-26 18:08:28 -08:00
Jason Evans	a6e86810d8	Refactor purging and splitting/merging. Split purging into lazy and forced variants. Use the forced variant for zeroing dss. Add support for NULL function pointers as an opt-out mechanism for the dalloc, commit, decommit, purge_lazy, purge_forced, split, and merge fields of extent_hooks_t. Add short-circuiting checks in large_ralloc_no_move_{shrink,expand}() so that no attempt is made if splitting/merging is not supported. This resolves #268.	2016-12-26 18:08:16 -08:00
Jason Evans	884fa22b8c	Rename arena_decay_t's ndirty to nunpurged.	2016-12-26 17:59:43 -08:00
Jason Evans	411697adcd	Use exponential series to size extents. If virtual memory is retained, allocate extents such that their sizes form an exponentially growing series. This limits the number of disjoint virtual memory ranges so that extent merging can be effective even if multiple arenas' extent allocation requests are highly interleaved. This resolves #462.	2016-12-26 17:59:42 -08:00
Jason Evans	c1baa0a9b7	Add huge page configuration and pages_[no}huge(). Add the --with-lg-hugepage configure option, but automatically configure LG_HUGEPAGE even if it isn't specified. Add the pages_[no]huge() functions, which toggle huge page state via madvise(..., MADV_[NO]HUGEPAGE) calls.	2016-12-26 17:59:34 -08:00
Jason Evans	bacb6afc6c	Simplify arena_slab_regind(). Rewrite arena_slab_regind() to provide sufficient constant data for the compiler to perform division strength reduction. This replaces more general manual strength reduction that was implemented before arena_bin_info was compile-time-constant. It would be possible to slightly improve on the compiler-generated division code by taking advantage of range limits that the compiler doesn't know about.	2016-12-23 10:34:34 -08:00
Jason Evans	69c26cdb01	Add some missing explicit casts.	2016-12-13 13:38:11 -08:00
Dave Watson	2319152d9f	jemalloc cpp new/delete bindings Adds cpp bindings for jemalloc, along with necessary autoconf settings. This is mostly to add sized deallocation support, which can't be added from C directly. Sized deallocation is ~10% microbench improvement. * Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the easiest way to get c++14 detection. * Adds various other changes, like CXXFLAGS, to configure.ac. * Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic unittest. * Both new and delete are overridden, to ensure jemalloc is used for both. * TODO future enhancement of avoiding extra PLT thunks for new and delete - sdallocx and malloc are publicly exported jemalloc symbols, using an alias would link them directly. Unfortunately, was having trouble getting it to play nice with jemalloc's namespace support. Testing: Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0. Only gcc >= 5 has sized deallocation support, verified that the rest build correctly. Tested mac osx and Centos. Tested --with-jemalloc-prefix and --without-export. This resolves #202.	2016-12-12 18:36:06 -08:00
Jason Evans	d4c5aceb7c	Add a_type parameter to qr_{meld,split}().	2016-12-12 18:16:51 -08:00
Jason Evans	acb7b1f53e	Add --disable-syscall. This resolves #517.	2016-12-03 16:50:58 -08:00
Jason Evans	32127949a3	Enable overriding JEMALLOC_{ALLOC,FREE}_JUNK. This resolves #509.	2016-11-22 10:58:58 -08:00
Jason Evans	c3b85f2585	Style fixes.	2016-11-22 10:58:23 -08:00
Jason Evans	5234be2133	Add pthread_atfork(3) feature test. Some versions of Android provide a pthreads library without providing pthread_atfork(), so in practice a separate feature test is necessary for the latter.	2016-11-17 15:14:57 -08:00
Jason Evans	fda60be799	Update a comment.	2016-11-17 11:50:52 -08:00
Jason Evans	a64123ce13	Refactor madvise(2) configuration. Add feature tests for the MADV_FREE and MADV_DONTNEED flags to madvise(2), so that MADV_FREE is detected and used for Linux kernel versions 4.5 and newer. Refactor pages_purge() so that on systems which support both flags, MADV_FREE is preferred over MADV_DONTNEED. This resolves #387.	2016-11-17 10:31:57 -08:00
Jason Evans	a38acf716e	Add extent serial numbers. Add extent serial numbers and use them where appropriate as a sort key that is higher priority than address, so that the allocation policy prefers older extents. This resolves #147.	2016-11-15 13:08:33 -08:00
Jason Evans	cda59f9970	Rename atomic__{uint32,uint64,u}() to atomic__{u32,u64,zu}(). This change conforms to naming conventions throughout the codebase.	2016-11-07 11:27:48 -08:00
Jason Evans	2e46b13ad5	Revert "Define 64-bits atomics unconditionally" This reverts commit `c2942e2c0e`. This resolves #495.	2016-11-07 10:53:35 -08:00
Jason Evans	04b463546e	Refactor prng to not use 64-bit atomics on 32-bit platforms. This resolves #495.	2016-11-07 10:52:44 -08:00
Jason Evans	ea9961acdb	Fix psz/pind edge cases. Add an "over-size" extent heap in which to store extents which exceed the maximum size class (plus cache-oblivious padding, if enabled). Remove psz2ind_clamp() and use psz2ind() instead so that trying to allocate the maximum size class can in principle succeed. In practice, this allows assertions to hold so that OOM errors can be successfully generated.	2016-11-03 22:33:34 -07:00
Jason Evans	8dd5ea87ca	Fix extent_alloc_cache[_locked]() to support decommitted allocation. Fix extent_alloc_cache[_locked]() to support decommitted allocation, and use this ability in arena_stash_dirty(), so that decommitted extents are not needlessly committed during purging. In practice this does not happen on any currently supported systems, because both extent merging and decommit must be implemented; all supported systems implement one xor the other.	2016-11-03 22:33:23 -07:00
Jason Evans	4f7d8c2dee	Update symbol mangling.	2016-11-03 15:00:02 -07:00
Dave Watson	25f7bbcf28	Fix long spinning in rtree_node_init rtree_node_init spinlocks the node, allocates, and then sets the node. This is under heavy contention at the top of the tree if many threads start to allocate at the same time. Instead, take a per-rtree sleeping mutex to reduce spinning. Tested both pthreads and osx OSSpinLock, and both reduce spinning adequately Previous benchmark time: ./ttest1 500 100 ~15s New benchmark time: ./ttest1 500 100 .57s	2016-11-02 20:30:53 -07:00
Jason Evans	d82f2b3473	Do not use syscall(2) on OS X 10.12 (deprecated).	2016-11-02 19:18:33 -07:00
Jason Evans	795f6689de	Add os_unfair_lock support. OS X 10.12 deprecated OSSpinLock; os_unfair_lock is the recommended replacement.	2016-11-02 18:09:45 -07:00
Jason Evans	d9f7b2a430	Fix/refactor zone allocator integration code. Fix zone_force_unlock() to reinitialize, rather than unlocking mutexes, since OS X 10.12 cannot tolerate a child unlocking mutexes that were locked by its parent. Refactor; this was a side effect of experimenting with zone {de,re}registration during fork(2).	2016-11-02 18:06:40 -07:00
Jason Evans	90b60eeae4	Add an assertion in witness_owner().	2016-10-31 15:28:22 -07:00
Jason Evans	6a834d94bb	Refactor witness_unlock() to fix undefined test behavior. This resolves #396.	2016-10-31 11:49:12 -07:00
Jason Evans	6c80321aed	Use CLOCK_MONOTONIC_COARSE rather than COARSE_MONOTONIC_RAW. The raw clock variant is slow (even relative to plain CLOCK_MONOTONIC), whereas the coarse clock variant is faster than CLOCK_MONOTONIC, but still has resolution (~1ms) that is adequate for our purposes. This resolves #479.	2016-10-29 22:58:18 -07:00
Dave Watson	8309388408	Support static linking of jemalloc with glibc glibc defines its malloc implementation with several weak and strong symbols: strong_alias (__libc_calloc, __calloc) weak_alias (__libc_calloc, calloc) strong_alias (__libc_free, __cfree) weak_alias (__libc_free, cfree) strong_alias (__libc_free, __free) strong_alias (__libc_free, free) strong_alias (__libc_malloc, __malloc) strong_alias (__libc_malloc, malloc) The issue is not with the weak symbols, but that other parts of glibc depend on __libc_malloc explicitly. Defining them in terms of jemalloc API's allows the linker to drop glibc's malloc.o completely from the link, and static linking no longer results in symbol collisions. Another wrinkle: jemalloc during initialization calls sysconf to get the number of CPU's. GLIBC allocates for the first time before setting up isspace (and other related) tables, which are used by sysconf. Instead, use the pthread API to get the number of CPUs with GLIBC, which seems to work. This resolves #442.	2016-10-28 15:08:19 -07:00
Jason Evans	48d4adfbeb	Avoid negation of unsigned numbers. Rather than relying on two's complement negation for alignment mask generation, use bitwise not and addition. This dodges warnings from MSVC, and should be strength-reduced by compiler optimization anyway.	2016-10-27 21:26:33 -07:00
Jason Evans	b54d160dc4	Do not (recursively) allocate within tsd_fetch(). Refactor tsd so that tsdn_fetch() does not trigger allocation, since allocation could cause infinite recursion. This resolves #458.	2016-10-20 23:59:12 -07:00
Jason Evans	577d4572b0	Make dss operations lockless. Rather than protecting dss operations with a mutex, use atomic operations. This has negligible impact on synchronization overhead during typical dss allocation, but is a substantial improvement for extent_in_dss() and the newly added extent_dss_mergeable(), which can be called multiple times during extent deallocations. This change also has the advantage of avoiding tsd in deallocation paths associated with purging, which resolves potential deadlocks during thread exit due to attempted tsd resurrection. This resolves #425.	2016-10-13 15:37:00 -07:00
Jason Evans	e5effef428	Add/use adaptive spinning. Add spin_t and spin_{init,adaptive}(), which provide a simple abstraction for adaptive spinning. Adaptively spin during busy waits in bootstrapping and rtree node initialization.	2016-10-13 14:55:39 -07:00
Jason Evans	9acd5cf178	Remove all vestiges of chunks. Remove mallctls: - opt.lg_chunk - stats.cactive This resolves #464.	2016-10-12 11:55:43 -07:00
Jason Evans	63b5657aa5	Remove ratio-based purging. Make decay-based purging the default (and only) mode. Remove associated mallctls: - opt.purge - opt.lg_dirty_mult - arena.<i>.lg_dirty_mult - arenas.lg_dirty_mult - stats.arenas.<i>.lg_dirty_mult This resolves #385.	2016-10-12 10:40:27 -07:00
Jason Evans	b4b4a77848	Fix and simplify decay-based purging. Simplify decay-based purging attempts to only be triggered when the epoch is advanced, rather than every time purgeable memory increases. In a correctly functioning system (not previously the case; see below), this only causes a behavior difference if during subsequent purge attempts the least recently used (LRU) purgeable memory extent is initially too large to be purged, but that memory is reused between attempts and one or more of the next LRU purgeable memory extents are small enough to be purged. In practice this is an arbitrary behavior change that is within the set of acceptable behaviors. As for the purging fix, assure that arena->decay.ndirty is recorded after the epoch advance and associated purging occurs. Prior to this fix, it was possible for purging during epoch advance to cause a substantially underrepresentative (arena->ndirty - arena->decay.ndirty), i.e. the number of dirty pages attributed to the current epoch was too low, and a series of unintended purges could result. This fix is also relevant in the context of the simplification described above, but the bug's impact would be limited to over-purging at epoch advances.	2016-10-11 15:30:01 -07:00
Jason Evans	5f11fb7d43	Do not advance decay epoch when time goes backwards. Instead, move the epoch backward in time. Additionally, add nstime_monotonic() and use it in debug builds to assert that time only goes backward if nstime_update() is using a non-monotonic time source.	2016-10-10 22:15:10 -07:00
Jason Evans	ee0c74b77a	Refactor arena->decay_* into arena->decay.* (arena_decay_t).	2016-10-10 20:32:19 -07:00
Jason Evans	e0164bc63c	Refine nstime_update(). Add missing #include <time.h>. The critical time facilities appear to have been transitively included via unistd.h and sys/time.h, but in principle this omission was capable of having caused clock_gettime(CLOCK_MONOTONIC, ...) to have been overlooked in favor of gettimeofday(), which in turn could cause spurious non-monotonic time updates. Refactor nstime_get() out of nstime_update() and add configure tests for all variants. Add CLOCK_MONOTONIC_RAW support (Linux-specific) and mach_absolute_time() support (OS X-specific). Do not fall back to clock_gettime(CLOCK_REALTIME, ...). This was a fragile Linux-specific workaround, which we're unlikely to use at all now that clock_gettime(CLOCK_MONOTONIC_RAW, ...) is supported, and if we have no choice besides non-monotonic clocks, gettimeofday() is only incrementally worse.	2016-10-10 10:33:59 -07:00
Jason Evans	871a9498e1	Fix size class overflow bugs. Avoid calling s2u() on raw extent sizes in extent_recycle(). Clamp psz2ind() (implemented as psz2ind_clamp()) when inserting/removing into/from size-segregated extent heaps.	2016-10-03 14:18:55 -07:00
Eric Le Bihan	df0d273a07	Fix LG_QUANTUM definition for sparc64 GCC 4.9.3 cross-compiled for sparc64 defines __sparc_v9__, not __sparc64__ nor __sparcv9. This prevents LG_QUANTUM from being defined properly. Adding this new value to the check solves the issue.	2016-09-26 15:13:07 -07:00
Jason Evans	61f467e16a	Avoid self assignment in tsd_set().	2016-09-23 12:21:34 -07:00
Jason Evans	0222fb41d1	Add various mutex ownership assertions.	2016-09-23 12:21:34 -07:00
Jason Evans	73868b60f2	Fix extent_{before,last,past}() to return page-aligned results.	2016-09-23 12:21:34 -07:00
Jason Evans	f6d01ff4b7	Protect extents_dirty access with extents_mtx. This fixes race conditions during purging.	2016-09-22 11:57:28 -07:00
Elliot Ronaghan	1167e9eff3	Check for __builtin_unreachable at configure time Add a configure check for __builtin_unreachable instead of basing its availability on the __GNUC__ version. On OS X using gcc (a real gcc, not the bundled version that's just a gcc front-end) leads to a linker assertion: https://github.com/jemalloc/jemalloc/issues/266 It turns out that this is caused by a gcc bug resulting from the use of __builtin_unreachable(): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438 To work around this bug, check that __builtin_unreachable() actually works at configure time, and if it doesn't use abort() instead. The check is based on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57438#c21. With this `make check` passes with a homebrew installed gcc-5 and gcc-6.	2016-07-07 13:28:44 -07:00
Mike Hommey	c2942e2c0e	Define 64-bits atomics unconditionally They are used on all platforms in prng.h.	2016-06-09 23:17:39 +09:00
Mike Hommey	0dad5b7719	Fix extent_*_get to build with MSVC	2016-06-09 22:00:18 +09:00
Elliot Ronaghan	8a1a794b0c	Don't use compact red-black trees with the pgi compiler Some bug (either in the red-black tree code, or in the pgi compiler) seems to cause red-black trees to become unbalanced. This issue seems to go away if we don't use compact red-black trees. Since red-black trees don't seem to be used much anymore, I opted for what seems to be an easy fix here instead of digging in and trying to find the root cause of the bug. Some context in case it's helpful: I experienced a ton of segfaults while using pgi as Chapel's target compiler with jemalloc 4.0.4. The little bit of debugging I did pointed me somewhere deep in red-black tree manipulation, but I didn't get a chance to investigate further. It looks like 4.2.0 replaced most uses of red-black trees with pairing-heaps, which seems to avoid whatever bug I was hitting. However, `make check_unit` was still failing on the rb test, so I figured the core issue was just being masked. Here's the `make check_unit` failure: ```sh === test/unit/rb === test_rb_empty: pass tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced tree_recurse:test/unit/rb.c:90: Failed assertion: (((_Bool) (((uintptr_t) (left_node)->link.rbn_right_red) & ((size_t)1)))) == (false) --> true != false: Node should be black test_rb_random:test/unit/rb.c:274: Failed assertion: (imbalances) == (0) --> 1 != 0: Tree is unbalanced node_remove:test/unit/rb.c:190: Failed assertion: (imbalances) == (0) --> 2 != 0: Tree is unbalanced <jemalloc>: test/unit/rb.c:43: Failed assertion: "pathp[-1].cmp < 0" test/test.sh: line 22: 12926 Aborted Test harness error ``` While starting to debug I saw the RB_COMPACT option and decided to check if turning that off resolved the bug. It seems to have fixed it (`make check_unit` passes and the segfaults under Chapel are gone) so it seems like on okay work-around. I'd imagine this has performance implications for red-black trees under pgi, but if they're not going to be used much anymore it's probably not a big deal.	2016-06-08 14:48:55 -07:00
Jason Evans	dd752c1ffd	Fix potential VM map fragmentation regression. Revert `245ae6036c` (Support --with-lg-page values larger than actual page size.), because it could cause VM map fragmentation if the kernel grows mmap()ed memory downward. This resolves #391.	2016-06-07 14:15:49 -07:00
Jason Evans	4e910fc958	Fix extent_alloc_dss() regressions. Page-align the gap, if any, and add/use extent_dalloc_gap(), which registers the gap extent before deallocation.	2016-06-05 21:00:02 -07:00
Jason Evans	04942c3d90	Remove a stray memset(), and fix a junk filling test regression.	2016-06-05 21:00:02 -07:00

... 3 4 5 6 7 ...

920 Commits