server-skynet-source-3rd-jemalloc

project-base/server-skynet-source-3rd-jemalloc

Author	SHA1	Message	Date
Daniel Micay	879e76a9e5	teach the dss chunk allocator to handle new_addr This provides in-place expansion of huge allocations when the end of the allocation is at the end of the sbrk heap. There's already the ability to extend in-place via recycled chunks but this handles the initial growth of the heap via repeated vector / string reallocations. A possible future extension could allow realloc to go from the following: \| huge allocation \| recycled chunks \| ^ dss_end To a larger allocation built from recycled and new chunks: \| huge allocation \| ^ dss_end Doing that would involve teaching the chunk recycling code to request new chunks to satisfy the request. The chunk_dss code wouldn't require any further changes. #include <stdlib.h> int main(void) { size_t chunk = 4 * 1024 * 1024; void ptr = NULL; for (size_t size = chunk; size < chunk 128; size = 2) { ptr = realloc(ptr, size); if (!ptr) return 1; } } dss:secondary: 0.083s dss:primary: 0.083s After: dss:secondary: 0.083s dss:primary: 0.003s The dss heap grows in the upwards direction, so the oldest chunks are at the low addresses and they are used first. Linux prefers to grow the mmap heap downwards, so the trick will not work in the current* mmap chunk allocator as a huge allocation will only be at the top of the heap in a contrived case.	2014-11-28 16:11:19 -08:00
Jason Evans	d49cb68b9e	Fix more pointer arithmetic undefined behavior. Reported by Guilherme Gonçalves. This resolves #166.	2014-11-17 10:31:59 -08:00
Jason Evans	2012d5a560	Fix pointer arithmetic undefined behavior. Reported by Denis Denisov.	2014-11-17 09:54:49 -08:00
Jason Evans	9cf2be0a81	Make quarantine_init() static.	2014-11-07 14:50:38 -08:00
Jason Evans	c002a5c800	Fix two quarantine regressions. Fix quarantine to actually update tsd when expanding, and to avoid double initialization (leaking the first quarantine) due to recursive initialization. This resolves #161.	2014-11-04 18:03:11 -08:00
Jason Evans	2b2f6dc1e4	Disable arena_dirty_count() validation.	2014-11-01 02:29:10 -07:00
Jason Evans	82cb603ed7	Don't dereference NULL tdata in prof_{enter,leave}(). It is possible for the thread's tdata to be NULL late during thread destruction, so take care not to dereference a NULL pointer in such cases.	2014-11-01 00:20:28 -07:00
Daniel Micay	dc65213111	rm unused arena wrangling from xallocx It has no use for the arena_t since unlike rallocx it never makes a new memory allocation. It's just an unused parameter in ixalloc_helper.	2014-10-30 23:19:34 -07:00
Jason Evans	cfc5706f69	Miscellaneous cleanups.	2014-10-30 23:18:45 -07:00
Daniel Micay	d33f834591	avoid redundant chunk header reads * use sized deallocation in iralloct_realign * iralloc and ixalloc always need the old size, so pass it in from the caller where it's often already calculated	2014-10-30 17:06:38 -07:00
Daniel Micay	809b0ac391	mark huge allocations as unlikely This cleans up the fast path a bit more by moving away more code.	2014-10-30 17:06:38 -07:00
Jason Evans	c93ed81cd0	Fix prof_{enter,leave}() calls to pass tdata_self.	2014-10-30 16:50:33 -07:00
Jason Evans	af1f592763	Use JEMALLOC_INLINE_C everywhere it's appropriate.	2014-10-30 16:38:08 -07:00
Jason Evans	8f47e3d82b	Merge pull request #151 from thestinger/ralloc use sized deallocation internally for ralloc	2014-10-16 13:12:05 -07:00
Daniel Micay	a9ea10d27c	use sized deallocation internally for ralloc The size of the source allocation is known at this point, so reading the chunk header can be avoided for the small size class fast path. This is not very useful right now, but it provides a significant performance boost with an alternate ralloc entry point taking the old size.	2014-10-16 15:39:59 -04:00
Jason Evans	c83bccd273	Initialize chunks_mtx for all configurations. This resolves #150.	2014-10-16 12:33:18 -07:00
Jason Evans	9673983443	Purge/zero sub-chunk huge allocations as necessary. Purge trailing pages during shrinking huge reallocation when resulting size is not a multiple of the chunk size. Similarly, zero pages if necessary during growing huge reallocation when the resulting size is not a multiple of the chunk size.	2014-10-15 18:02:02 -07:00
Jason Evans	bf8d6a1092	Add small run utilization to stats output. Add the 'util' column, which reports the proportion of available regions that are currently in use for each small size class. Small run utilization is the complement of external fragmentation. For example, utilization of 0.75 indicates that 25% of small run memory is consumed by external fragmentation, in other (more obtuse) words, 33% external fragmentation overhead. This resolves #27.	2014-10-15 16:18:42 -07:00
Jason Evans	9b41ac909f	Fix huge allocation statistics.	2014-10-14 22:20:00 -07:00
Jason Evans	3c4d92e82a	Add per size class huge allocation statistics. Add per size class huge allocation statistics, and normalize various stats: - Change the arenas.nlruns type from size_t to unsigned. - Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's. - Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with stats.arenas.<i>.bins.<j>.curregs . - Add the stats.arenas.<i>.hchunks.<j>.nmalloc, stats.arenas.<i>.hchunks.<j>.ndalloc, stats.arenas.<i>.hchunks.<j>.nrequests, and stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.	2014-10-12 23:02:10 -07:00
Jason Evans	44c97b712e	Fix a prof_tctx_t/prof_tdata_t cleanup race. Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid in prof_tctx_t, so that the associated tdata need not be present during tctx teardown.	2014-10-12 13:03:20 -07:00
Jason Evans	381c23dd9d	Remove arena_dalloc_bin_run() clean page preservation. Remove code in arena_dalloc_bin_run() that preserved the "clean" state of trailing clean pages by splitting them into a separate run during deallocation. This was a useful mechanism for reducing dirty page churn when bin runs comprised many pages, but bin runs are now quite small. Remove the nextind field from arena_run_t now that it is no longer needed, and change arena_run_t's bin field (arena_bin_t *) to binind (index_t). These two changes remove 8 bytes of chunk header overhead per page, which saves 1/512 of all arena chunk memory.	2014-10-10 23:01:03 -07:00
Jason Evans	81e547566e	Add --with-lg-tiny-min, generalize --with-lg-quantum.	2014-10-10 22:35:07 -07:00
Jason Evans	9b75677e53	Don't fetch tsd in a0{d,}alloc(). Don't fetch tsd in a0{d,}alloc(), because doing so can cause infinite recursion on systems that require an allocated tsd wrapper.	2014-10-10 18:19:20 -07:00
Jason Evans	fc0b3b7383	Add configure options. Add: --with-lg-page --with-lg-page-sizes --with-lg-size-class-group --with-lg-quantum Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE. Fix various edge conditions exposed by the configure options.	2014-10-09 22:44:37 -07:00
Jason Evans	57efa7bb0e	Avoid atexit(3) when possible, disable prof_final by default. atexit(3) can deadlock internally during its own initialization if jemalloc calls atexit() during jemalloc initialization. Mitigate the impact by restructuring prof initialization to avoid calling atexit() unless the registered function will actually dump a final heap profile. Additionally, disable prof_final by default so that this land mine is opt-in rather than opt-out. This resolves #144.	2014-10-08 18:08:00 -07:00
Jason Evans	3a8b9b1fd9	Fix a recursive lock acquisition regression. Fix a recursive lock acquisition regression, which was introduced by 8bb3198f72fc7587dc93527f9f19fb5be52fa553 (Refactor/fix arenas manipulation.).	2014-10-08 00:54:16 -07:00
Daniel Micay	f22214a29d	Use regular arena allocation for huge tree nodes. This avoids grabbing the base mutex, as a step towards fine-grained locking for huge allocations. The thread cache also provides a tiny (~3%) improvement for serial huge allocations.	2014-10-07 23:57:09 -07:00
Jason Evans	8bb3198f72	Refactor/fix arenas manipulation. Abstract arenas access to use arena_get() (or a0get() where appropriate) rather than directly reading e.g. arenas[ind]. Prior to the addition of the arenas.extend mallctl, the worst possible outcome of directly accessing arenas was a stale read, but arenas.extend may allocate and assign a new array to arenas. Add a tsd-based arenas_cache, which amortizes arenas reads. This introduces some subtle bootstrapping issues, with tsd_boot() now being split into tsd_boot[01]() to support tsd wrapper allocation bootstrapping, as well as an arenas_cache_bypass tsd variable which dynamically terminates allocation of arenas_cache itself. Promote a0malloc(), a0calloc(), and a0free() to be generally useful for internal allocation, and use them in several places (more may be appropriate). Abstract arena->nthreads management and fix a missing decrement during thread destruction (recent tsd refactoring left arenas_cleanup() unused). Change arena_choose() to propagate OOM, and handle OOM in all callers. This is important for providing consistent allocation behavior when the MALLOCX_ARENA() flag is being used. Prior to this fix, it was possible for an OOM to result in allocation silently allocating from a different arena than the one specified.	2014-10-07 23:14:57 -07:00
Jason Evans	bf40641c5c	Fix a prof_tctx_t destruction race.	2014-10-06 16:35:11 -07:00
Jason Evans	155bfa7da1	Normalize size classes. Normalize size classes to use the same number of size classes per size doubling (currently hard coded to 4), across the intire range of size classes. Small size classes already used this spacing, but in order to support this change, additional small size classes now fill [4 KiB .. 16 KiB). Large size classes range from [16 KiB .. 4 MiB). Huge size classes now support non-multiples of the chunk size in order to fill (4 MiB .. 16 MiB).	2014-10-06 01:45:13 -07:00
Daniel Micay	a95018ee81	Attempt to expand huge allocations in-place. This adds support for expanding huge allocations in-place by requesting memory at a specific address from the chunk allocator. It's currently only implemented for the chunk recycling path, although in theory it could also be done by optimistically allocating new chunks. On Linux, it could attempt an in-place mremap. However, that won't work in practice since the heap is grown downwards and memory is not unmapped (in a normal build, at least). Repeated vector reallocation micro-benchmark: #include <string.h> #include <stdlib.h> int main(void) { for (size_t i = 0; i < 100; i++) { void ptr = NULL; size_t old_size = 0; for (size_t size = 4; size < (1 << 30); size = 2) { ptr = realloc(ptr, size); if (!ptr) return 1; memset(ptr + old_size, 0xff, size - old_size); old_size = size; } free(ptr); } } The glibc allocator fails to do any in-place reallocations on this benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it elides the cost of copies via mremap, which is currently not something that jemalloc can use. With this improvement, jemalloc still fails to do any in-place huge reallocations for the first outer loop, but then succeeds 100% of the time for the remaining 99 iterations. The time spent doing allocations and copies drops down to under 5%, with nearly all of it spent doing purging + faulting (when huge pages are disabled) and the array memset. An improved mremap API (MREMAP_RETAIN - #138) would be far more general but this is a portable optimization and would still be useful on Linux for xallocx. Numbers with transparent huge pages enabled: glibc (copies elided via MREMAP_MAYMOVE): 8.471s jemalloc: 17.816s jemalloc + no-op madvise: 13.236s jemalloc + this commit: 6.787s jemalloc + this commit + no-op madvise: 6.144s Numbers with transparent huge pages disabled: glibc (copies elided via MREMAP_MAYMOVE): 15.403s jemalloc: 39.456s jemalloc + no-op madvise: 12.768s jemalloc + this commit: 15.534s jemalloc + this commit + no-op madvise: 6.354s Closes #137	2014-10-05 14:47:01 -07:00
Jason Evans	f11a6776c7	Fix OOM-related regression in arena_tcache_fill_small(). Fix an OOM-related regression in arena_tcache_fill_small() that caused cache corruption that would almost certainly expose the application to undefined behavior, usually in the form of an allocation request returning an already-allocated region, or somewhat less likely, a freed region that had already been returned to the arena, thus making it available to the arena for any purpose. This regression was introduced by 9c43c13a35220c10d97a886616899189daceb359 (Reverse tcache fill order.), and was present in all releases from 2.2.0 through 3.6.0. This resolves #98.	2014-10-05 13:05:10 -07:00
Jason Evans	f04a0bef99	Fix prof regressions. Fix prof regressions related to tdata (main per thread profiling data structure) destruction: - Deadlock. The fix for this was intended to be part of 20c31deaae38ed9aa4fe169ed65e0c45cd542955 (Test prof.reset mallctl and fix numerous discovered bugs.) but the fix was left incomplete. - Destruction race. Detaching tdata just prior to destruction without holding the tdatas lock made it possible for another thread to destroy the tdata out from under the thread that was on its way to doing so.	2014-10-04 15:03:49 -07:00
Jason Evans	0800afd03f	Silence a compiler warning.	2014-10-04 14:59:17 -07:00
Jason Evans	029d44cf8b	Fix tsd cleanup regressions. Fix tsd cleanup regressions that were introduced in 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf (Convert all tsd variables to reside in a single tsd structure.). These regressions were twofold: 1) tsd_tryget() should never (and need never) return NULL. Rename it to tsd_fetch() and simplify all callers. 2) tsd__set() must only be called when tsd is in the nominal state, because cleanup happens during the nominal-->purgatory transition, and re-initialization must not happen while in the purgatory state. Add tsd_nominal() and use it as needed. Note that tsd_{p,}_get() can still be used as long as no re-initialization that would require cleanup occurs. This means that e.g. the thread_allocated counter can be updated unconditionally.	2014-10-04 11:22:55 -07:00
Jason Evans	fc12c0b8bc	Implement/test/fix prof-related mallctl's. Implement/test/fix the opt.prof_thread_active_init, prof.thread_active_init, and thread.prof.active mallctl's. Test/fix the thread.prof.name mallctl. Refactor opt_prof_active to be read-only and move mutable state into the prof_active variable. Stop leaning on ctl-related locking for protection.	2014-10-03 23:25:30 -07:00
Jason Evans	551ebc4364	Convert to uniform style: cond == false --> !cond	2014-10-03 10:16:09 -07:00
Jason Evans	20c31deaae	Test prof.reset mallctl and fix numerous discovered bugs.	2014-10-02 23:01:10 -07:00
Daniel Micay	f8034540a1	Implement in-place huge allocation shrinking. Trivial example: #include <stdlib.h> int main(void) { void ptr = malloc(1024 1024 * 8); if (!ptr) return 1; ptr = realloc(ptr, 1024 * 1024 * 4); if (!ptr) return 1; } Before: mmap(NULL, 8388608, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0) = 0x7fcfff000000 mmap(NULL, 4194304, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0) = 0x7fcffec00000 madvise(0x7fcfff000000, 8388608, MADV_DONTNEED) = 0 After: mmap(NULL, 8388608, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0) = 0x7f1934800000 madvise(0x7f1934c00000, 4194304, MADV_DONTNEED) = 0 Closes #134	2014-10-01 16:55:03 -07:00
Dave Rigby	e3a16fce5e	Mark malloc_conf as a weak symbol This fixes issue #113 - je_malloc_conf is not respected on OS X	2014-09-29 15:05:55 -07:00
Jason Evans	0c5dd03e88	Move small run metadata into the arena chunk header. Move small run metadata into the arena chunk header, with multiple expected benefits: - Lower run fragmentation due to reduced run sizes; runs are more likely to completely drain when there are fewer total regions. - Improved cache behavior. Prior to this change, run headers were always page-aligned, which put extra pressure on some CPU cache sets. The degree to which this was a problem was hardware dependent, but it likely hurt some even for the most advanced modern hardware. - Buffer overruns/underruns are less likely to corrupt allocator metadata. - Size classes between 4 KiB and 16 KiB become reasonable to support without any special handling, and the runs are small enough that dirty unused pages aren't a significant concern.	2014-09-29 01:31:39 -07:00
Jason Evans	f97e5ac4ec	Implement compile-time bitmap size computation.	2014-09-28 14:43:11 -07:00
Jason Evans	6ef80d68f0	Fix profile dumping race. Fix a race that caused a non-critical assertion failure. To trigger the race, a thread had to be part way through initializing a new sample, such that it was discoverable by the dumping thread, but not yet linked into its gctx by the time a later dump phase would normally have reset its state to 'nominal'. Additionally, lock access to the state field during modification to transition to the dumping state. It's not apparent that this oversight could have caused an actual problem due to outer locking that protects the dumping machinery, but the added locking pedantically follows the stated locking protocol for the state field.	2014-09-24 22:23:43 -07:00
Jason Evans	5460aa6f66	Convert all tsd variables to reside in a single tsd structure.	2014-09-23 02:36:08 -07:00
Jason Evans	9d8f3d2033	Fix prof regressions. Don't use atomic_add_uint64(), because it isn't available on 32-bit platforms. Fix forking support functions to manage all prof-related mutexes. These regressions were introduced by 602c8e0971160e4b85b08b16cf8a2375aa24bc04 (Implement per thread heap profiling.), which did not make it into any releases prior to these fixes.	2014-09-11 18:09:14 -07:00
Jason Evans	c3e9e7b041	Fix irallocx_prof() sample logic. Fix irallocx_prof() sample logic to only update the threshold counter after it knows what size the allocation ended up being. This regression was caused by 6e73dc194ee9682d3eacaf725a989f04629718f7 (Fix a profile sampling race.), which did not make it into any releases prior to this fix.	2014-09-11 17:04:03 -07:00
Jason Evans	9c640bfdd4	Apply likely()/unlikely() to allocation/deallocation fast paths.	2014-09-11 17:01:58 -07:00
Jason Evans	91566fc079	Fix mallocx() to always honor MALLOCX_ARENA() when profiling.	2014-09-11 13:15:33 -07:00
Daniel Micay	23fdf8b359	mark some conditions as unlikely * assertion failure * malloc_init failure * malloc not already initialized (in malloc_init) * running in valgrind * thread cache disabled at runtime Clang and GCC already consider a comparison with NULL or -1 to be cold, so many branches (out-of-memory) are already correctly considered as cold and marking them is not important.	2014-09-10 21:49:42 -04:00

... 21 22 23 24 25 ...

1421 Commits