Clean up configuration for backtracing when profiling is enabled, and
document the configuration logic in INSTALL.
Disable libgcc-based backtracing except on x64 (where it is known to
work).
Add the --disable-prof-gcc option.
Fix the automatic header dependency generation to handle the .pic.o
suffix. This regression was due to:
Build both PIC and no PIC static libraries
af5d6987f8
Convert all direct small_size2bin[...] accesses to SMALL_SIZE2BIN(...)
macro calls, and use a couple of cheap math operations to allow
compacting the table by 4X or 8X, on 32- and 64-bit systems,
respectively.
Compile with -fvisibility=hidden rather than -fvisibility=internal, in
order to avoid PLT lookups for internal functions. Also fix a
regression that caused the -fvisibility flag to be omitted, due to:
Port to Mac OS X.
2dbecf1f62
When a thread cache flushes objects to their arenas due to an abundance
of cached objects, it merges the allocation request count for the
associated size class, and increments a flush counter. If none of the
flushed objects came from the thread's assigned arena, then the merging
wouldn't happen (though the counter would typically eventually be
merged), nor would the flush counter be incremented (a hard bug). Fix
this via extra conditional code just after the flush loop.
When jemalloc is linked into an executable (as opposed to a shared
library), compiling with -fno-pic can have significant advantages,
mainly because we don't have to go throught the GOT (global offset
table).
Users who want to link jemalloc into a shared library that could
be dlopened need to link with libjemalloc_pic.a or libjemalloc.so.
For the non-TLS case (as on OS X), if the "thread.{de,}allocatedp"
mallctl was called before any allocation occurred for that thread, the
TSD was still NULL, thus putting the application at risk of
dereferencing NULL. Fix this by refactoring the initialization code,
and making it part of the conditional logic for all per thread
allocation counter accesses.
Fix huge_ralloc() to call huge_palloc() only if alignment requires it.
This bug caused under-sized allocation for aligned huge reallocation
(via rallocm()) if the requested alignment was less than the chunk size
(4 MiB by default).
Fix ALLOCM_LG_ALIGN to take a parameter and use it. Apparently, an
editing error left ALLOCM_LG_ALIGN with the same definition as
ALLOCM_LG_ALIGN_MASK.
Restructure the ctx initialization code such that the ctx isn't locked
across portions of the initialization code where allocation could occur.
Instead artificially inflate the cnt_merged.curobjs field, just as is
done elsewhere to avoid similar races to the one that would otherwise be
created by the reduction in locking scope.
This bug affected interval- and growth-triggered heap dumping, but not
manual heap dumping.
When setting a new arena association for the calling thread, also update
the tcache's cached arena pointer, primarily so that
tcache_alloc_small_hard() uses the intended arena.
Remove the constraint that small run headers fit in one page. This
constraint was necessary to avoid dirty page purging issues for unused
pages within runs for medium size classes (which no longer exist).
If mremap(2) is available and supports MREMAP_FIXED, use it for huge
realloc().
Initialize rtree later during bootstrapping, so that --enable-debug
--enable-dss works.
Fix a minor swap_avail stats bug.
Convert the man page source from roff to DocBook, and generate html and
roff output. Modify the build system such that the documentation can be
built as part of the release process, so that users need not have
DocBook tools installed.
Many mallctl*() end points require no locking, so push the locking down
to just the functions that need it. This is of particular import for
"thread.allocated" and "thread.deallocated", which are intended as a
low-overhead way to introspect per thread allocation activity.
Replace the single-character run-time flags with key/value pairs, which
can be set via the malloc_conf global, /etc/malloc.conf, and the
MALLOC_CONF environment variable.
Replace the JEMALLOC_PROF_PREFIX environment variable with the
"opt.prof_prefix" option.
Replace umax2s() with u2s().
Fix a regression due to the recent heap profiling accuracy improvements:
prof_{m,re}alloc() must set the object's profiling context regardless of
whether it is sampled.
Fix management of the CHUNK_MAP_CLASS chunk map bits, such that all
large object (re-)allocation paths correctly initialize the bits. Prior
to this fix, in-place realloc() cleared the bits, resulting in incorrect
reported object size from arena_salloc_demote(). After this fix the
non-demoted bit pattern is all zeros (instead of all ones), which makes
it easier to assure that the bits are properly set.
Inline the heap sampling code that is executed for every allocation
event (regardless of whether a sample is taken).
Combine all prof TLS data into a single data structure, in order to
reduce the TLS lookup volume.
Add the "thread.allocated" and "thread.deallocated" mallctls, which can
be used to query the total number of bytes ever allocated/deallocated by
the calling thread.
Add s2u() and sa2u(), which can be used to compute the usable size that
will result from an allocation request of a particular size/alignment.
Re-factor ipalloc() to use sa2u().
Enhance the heap profiler to trigger samples based on usable size,
rather than request size. This has a subtle, but important, impact on
the accuracy of heap sampling. For example, previous to this change,
16- and 17-byte objects were sampled at nearly the same rate, but
17-byte objects actually consume 32 bytes each. Therefore it was
possible for the sample to be somewhat skewed compared to actual memory
usage of the allocated objects.
Fix the newsize argument to arena_run_trim_tail() that
arena_dalloc_bin_run() passes. Previously, oldsize-newsize (i.e. the
complement) was passed, which could erroneously cause dirty pages to be
returned to the clean available runs tree. Prior to the
CHUNK_MAP_ZEROED --> CHUNK_MAP_UNZEROED conversion, this bug merely
caused dirty pages to be unaccounted for (and therefore never get
purged), but with CHUNK_MAP_UNZEROED, this could cause dirty pages to be
treated as zeroed (i.e. memory corruption).