The previous free list implementation, which embedded singly linked
lists in available regions, had the unfortunate side effect of causing
many cache misses during thread cache fills. Fix this in two places:
- arena_run_t: Use a new bitmap implementation to track which regions
are available. Furthermore, revert to preferring the
lowest available region (as jemalloc did with its old
bitmap-based approach).
- tcache_t: Move read-only tcache_bin_t metadata into
tcache_bin_info_t, and add a contiguous array of pointers
to tcache_t in order to track cached objects. This
substantially increases the size of tcache_t, but results
in much higher data locality for common tcache operations.
As a side benefit, it is again possible to efficiently
flush the least recently used cached objects, so this
change changes flushing from MRU to LRU.
The new bitmap implementation uses a multi-level summary approach to
make finding the lowest available region very fast. In practice,
bitmaps only have one or two levels, though the implementation is
general enough to handle extremely large bitmaps, mainly so that large
page sizes can still be entertained.
Fix tcache_bin_flush_large() to always flush statistics, in the same way
that tcache_bin_flush_small() was recently fixed.
Use JEMALLOC_DEBUG rather than NDEBUG.
Add dassert(), and use it for debug-only asserts.
Fix the automatic header dependency generation to handle the .pic.o
suffix. This regression was due to:
Build both PIC and no PIC static libraries
af5d6987f8
When jemalloc is linked into an executable (as opposed to a shared
library), compiling with -fno-pic can have significant advantages,
mainly because we don't have to go throught the GOT (global offset
table).
Users who want to link jemalloc into a shared library that could
be dlopened need to link with libjemalloc_pic.a or libjemalloc.so.
For the non-TLS case (as on OS X), if the "thread.{de,}allocatedp"
mallctl was called before any allocation occurred for that thread, the
TSD was still NULL, thus putting the application at risk of
dereferencing NULL. Fix this by refactoring the initialization code,
and making it part of the conditional logic for all per thread
allocation counter accesses.
If mremap(2) is available and supports MREMAP_FIXED, use it for huge
realloc().
Initialize rtree later during bootstrapping, so that --enable-debug
--enable-dss works.
Fix a minor swap_avail stats bug.
Convert the man page source from roff to DocBook, and generate html and
roff output. Modify the build system such that the documentation can be
built as part of the release process, so that users need not have
DocBook tools installed.
Add the "thread.allocated" and "thread.deallocated" mallctls, which can
be used to query the total number of bytes ever allocated/deallocated by
the calling thread.
Add s2u() and sa2u(), which can be used to compute the usable size that
will result from an allocation request of a particular size/alignment.
Re-factor ipalloc() to use sa2u().
Enhance the heap profiler to trigger samples based on usable size,
rather than request size. This has a subtle, but important, impact on
the accuracy of heap sampling. For example, previous to this change,
16- and 17-byte objects were sampled at nearly the same rate, but
17-byte objects actually consume 32 bytes each. Therefore it was
possible for the sample to be somewhat skewed compared to actual memory
usage of the allocated objects.
Add test/jemalloc_test.h.in, which is processed to include
jemalloc/jemalloc@install_suffix@.h, so that test programs can include
it without worrying about the install suffix.
Add allocm(), rallocm(), sallocm(), and dallocm(), which are a
functional superset of malloc(), calloc(), posix_memalign(),
malloc_usable_size(), and free().
Fix divide-by-zero error in pprof. It is possible for sample contexts
to currently have no associated objects, but the cumulative statistics
are still useful, depending on how the user invokes pprof. Since
jemalloc intentionally does not filter such contexts, take care not to
divide by 0 when re-scaling for v2 heap sampling.
Install pprof as part of 'make install'.
Update pprof documentation.
Don't look for a shared libunwind if --with-static-libunwind is
specified.
Set SONAME when linking the shared libjemalloc.
Add DESTDIR support.
Add install_{include,lib/man} build targets.
Clean up compiler flag configuration.
Remove all functionality related to tracing. This functionality was
useful for understanding memory fragmentation during early algorithmic
design of jemalloc, but it had little utility for non-trivial
applications, due to the sheer volume of data written to disk.
Replace chunk stats code that was missing locking; this fixes a race
condition that could corrupt chunk statistics.
Converting malloc_stats_print() to use mallctl*().
Add a missing semicolon in th DSS code.
Convert malloc_tcache_flush() to a mallctl.
Convert malloc_swap_enable() to a set of mallctl's.
Fix a stats bug in large object curruns accounting.
Replace tcache_bin_fill() with arena_tcache_fill(), and fix a bug in an OOM
error path.
Fix API name mangling to coexist with __attribute__((malloc)).
jemalloc is configured.
Modify arena_malloc() API to avoid unnecessary choose_arena() calls. Remove
unnecessary code from choose_arena().
Enable lazy-lock by default, now that choose_arena() is both faster and out of
the critical path.
Implement objdir support in the build system.
Implement minimal Makefile.
Make compile-time-optional jemalloc features controllable via configure
options (debug, stats, tiny, mag, balance, dss).
Conditionally exclude most of the opt_* run-time options, based on configure
options (fill, xmalloc, sysv).
Implement optional --enable-dynamic-page-shift.
Implement optional --enable-lazy-lock.
Re-order malloc_init_hard() and use the malloc_initializer variable to support
recursive allocation in malloc_ncpus().
Add mag_rack_tsd in order to receive notifications of thread termination.
Add jemalloc.h.