Commit Graph

1792 Commits

Author SHA1 Message Date
Qi Wang
ddb170b1d9 Simplify arena_migrate() to take arena_t* instead of indices.
This makes debugging slightly easier and avoids the confusion of "should we
create new arenas" here.
2022-01-11 16:59:22 -08:00
Qi Wang
d66162e032 Fix the extent state checking on the merge error path.
With DSS as primary, the default merge impl will (correctly) decline to merge
when one of the extent is non-dss.  The error path should tolerate the
not-merged extent being in a merging state.
2022-01-11 16:58:47 -08:00
Qi Wang
61978bbe69 Purge all if the last thread migrated away from an arena. 2022-01-06 19:02:26 -08:00
Yuriy Chernyshov
c91e62dd37 #include <features.h> as requested 2022-01-05 18:45:27 -08:00
Yuriy Chernyshov
18510020e7 Fix symbol conflict with musl libc
`__libc` prefixed functions are used by musl libc as non-replaceable malloc stubs.

Fix this conflict by checking if we are linking against glibc.
2022-01-05 18:45:27 -08:00
Qi Wang
f509703af5 Fix two conversion warnings in tcache. 2022-01-04 13:55:06 -08:00
Qi Wang
8b34a788b5 Fix an used-uninitialized warning (false positive). 2021-12-29 14:44:43 -08:00
Qi Wang
e491cef9ab Add stats for stashed bytes in tcache. 2021-12-29 14:44:43 -08:00
Qi Wang
b75822bc6e Implement use-after-free detection using junk and stash.
On deallocation, sampled pointers (specially aligned) get junked and stashed
into tcache (to prevent immediate reuse).  The expected behavior is to have
read-after-free corrupted and stopped by the junk-filling, while
write-after-free is checked when flushing the stashed pointers.
2021-12-29 14:44:43 -08:00
Qi Wang
06aac61c4b Split the core logic of tcache flush into a separate function.
The core function takes a ptr array as input (containing items to be flushed),
which will be reused to flush sanitizer-stashed items.
2021-12-29 14:44:43 -08:00
Qi Wang
d038160f3b Fix shadowed variable usage.
Verified with EXTRA_CFLAGS=-Wshadow.
2021-12-23 10:55:08 -08:00
Qi Wang
60b9637cc0 Only invoke malloc_cpu_count_is_deterministic() when necessary.
Also refactor the handling of the non-deterministic case.  Notably allow the
case with narenas set to proceed w/o warnings, to not affect existing valid use
cases.
2021-12-22 13:52:12 -08:00
Qi Wang
837b37c4ce Fix the time-since computation in HPA.
nstime module guarantees monotonic clock update within a single nstime_t.  This
means, if two separate nstime_t variables are read and updated separately,
nstime_subtract between them may result in underflow.  Fixed by switching to the
time since utility provided by nstime.
2021-12-21 23:37:22 -08:00
Qi Wang
310af725b0 Add nstime_ns_since which obtains the duration since the input time. 2021-12-21 23:37:22 -08:00
Azat Khuzhin
cafe9a3158 Disable percpu arena in case of non deterministic CPU count
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: https://github.com/lxc/lxcfs/issues/301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: https://github.com/jemalloc/jemalloc/pull/1939
Refs: https://github.com/ClickHouse/ClickHouse/issues/32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
2021-12-21 11:53:09 -08:00
mweisgut
bb5052ce90 Fix base_ehooks_get_for_metadata 2021-12-20 15:37:53 -08:00
Alex Lapenkou
d90655390f San: Create a function for committing and zeroing
Committing and zeroing an extent is usually done together, hence a new
function.
2021-12-15 10:39:17 -08:00
Alex Lapenkou
800ce49c19 San: Bump alloc frequently reused guarded allocations
To utilize a separate retained area for guarded extents, use bump alloc
to allocate those extents.
2021-12-15 10:39:17 -08:00
Alex Lapenkou
f56f5b9930 Pass 'frequent_reuse' hint to PAI
Currently used only for guarding purposes, the hint is used to determine
if the allocation is supposed to be frequently reused. For example, it
might urge the allocator to ensure the allocation is cached.
2021-12-15 10:39:17 -08:00
Alex Lapenkou
0f6da1257d San: Implement bump alloc
The new allocator will be used to allocate guarded extents used as slabs
for guarded small allocations.
2021-12-15 10:39:17 -08:00
Alex Lapenkou
62f9c54d2a San: Rename 'guard' to 'san'
This prepares the foundation for more sanitizer-related work in the
future.
2021-12-15 10:39:17 -08:00
Qi Wang
7dcf77809c Mark slab as true on sized dealloc fast path.
For sized dealloc, fastpath only handles lookup-able sizes, which must be slabs.
2021-12-06 14:28:34 -08:00
Qi Wang
af6ee27c0d Enforce abort_conf:true when malloc_conf is not fully recognized.
Ensures the malloc_conf "ends with key", "ends with comma" and "malform conf
string" cases abort under abort_conf:true.
2021-12-06 14:27:25 -08:00
Qi Wang
cdabe908d0 Track the initialized state of nstime_t on debug build.
Some nstime_t operations require and assume the input nstime is initialized
(e.g. nstime_update) -- uninitialized input may cause silent failures which is
difficult to reproduce / debug.  Add an explicit flag to track the state
(limited to debug build only).

Also fixed an use case in hpa (time of last_purge).
2021-11-17 15:49:27 -08:00
Qi Wang
400c59895a Fix uninitialized nstime reading / updating on the stack in hpa.
In order for nstime_update to handle non-monotonic clocks, it requires the input
nstime to be initialized -- when reading for the first time, zero init has to be
done.  Otherwise random stack value may be seen as clocks and returned.
2021-11-16 16:54:12 -08:00
Qi Wang
8b81d3f214 Fix the initialization of last_event in thread event init.
The event counters maintain a relationship with the current bytes: last_event <=
current < next_event.  When a reinit happens (e.g. reincarnated tsd), the last
event needs progressing because all events start fresh from the current bytes.
2021-11-16 10:28:00 -08:00
Qi Wang
6bdb4f5ab0 Check prof_active in addtion to opt_prof during batch_alloc(). 2021-11-12 09:20:18 -08:00
Qi Wang
37342a4d32 Add ctl interface for experimental_infallible_new. 2021-11-05 13:20:09 -07:00
Alex Lapenkou
6cb585b13a San: Unguard guarded slabs during arena destruction
When opt_retain is on, slab extents remain guarded in all states, even
retained. This works well if arena is never destroyed, because we
anticipate those slabs will be eventually reused. But if the arena is
destroyed, the slabs must be unguarded to prevent leaking guard pages.
2021-11-03 17:55:50 -07:00
Qi Wang
4d56aaeca5 Optimize away the tsd_fast() check on free fastpath.
To ensure that the free fastpath can tolerate uninitialized tsd, improved the
static initializer for rtree_ctx in tsd.
2021-10-28 10:05:59 -07:00
Alex Lapenkou
8daac7958f Redefine functions with test hooks only for tests
Android build has issues with these defines, this will allow the build to
succeed if it doesn't need to build the tests.
2021-10-15 15:25:36 -07:00
Alex Lapenkou
c9ebff0fd6 Initialize deferred_work_generated
As the code evolves, some code paths that have previously assigned
deferred_work_generated may cease being reached. This would leave the value
uninitialized. This change initializes the value for safety.
2021-10-07 11:50:38 -07:00
Stan Angelov
912324a1ac Add debug check outside of the loop in hpa_alloc_batch.
This optimizes the whole loop away for non-debug builds.
2021-10-01 14:40:43 -07:00
David CARLIER
cf9724531a Darwin malloc_size override support proposal.
Darwin has similar api than Linux/FreeBSD's malloc_usable_size.
2021-10-01 14:32:40 -07:00
Qi Wang
ab0f1604b4 Delay the atexit call to prof_log_start().
So that atexit() is only done when prof_log is used.
2021-09-29 13:35:50 -07:00
David Carlier
11b6db7448 CPU affinity on BSD platforms support. 2021-09-28 11:40:21 -07:00
Qi Wang
83f3294027 Small refactors around 7bb05e0. 2021-09-27 16:05:13 -07:00
Qi Wang
deb8e62a83 Implement guard pages.
Adding guarded extents, which are regular extents surrounded by guard pages
(mprotected).  To reduce syscalls, small guarded extents are cached as a
separate eset in ecache, and decay through the dirty / muzzy / retained pipeline
as usual.
2021-09-26 16:30:15 -07:00
Piotr Balcer
7bb05e04be add experimental.arenas_create_ext mallctl
This mallctl accepts an arena_config_t structure which
can be used to customize the behavior of the arena.
Right now it contains extent_hooks and a new option,
metadata_use_hooks, which controls whether the extent
hooks are also used for metadata allocation.

The medata_use_hooks option has two main use cases:

1. In heterogeneous memory systems, to avoid metadata
being placed on potentially slower memory.

2. Avoiding virtual memory from being leaked as a result
of metadata allocation failure originating in an extent hook.
2021-09-24 13:43:18 -07:00
Alex Lapenkou
a9031a0970 Allow setting a dump hook
If users want to be notified when a heap dump occurs, they can set this hook.
2021-09-22 15:04:01 -07:00
Alex Lapenkou
f7d46b8119 Allow setting custom backtrace hook
Existing backtrace implementations skip native stack frames from runtimes like
Python. The hook allows to augment the backtraces to attribute allocations to
native functions in heap profiles.
2021-09-22 15:04:01 -07:00
Qi Wang
523cfa55c5 Guard prof related mallctl with opt_prof.
The prof initialization is done only when opt_prof is true.  This change makes
sure the prof_* mallctls only have limited read access (i.e. no access to prof
internals) when opt_prof is false.

In addition, initialize the global prof mutexes even if opt_prof is false.  This
makes sure the mutex stats are set properly.
2021-09-20 10:42:16 -07:00
Alex Lapenkou
6e848a005e Remove opt_background_thread_hpa_interval_max_ms
Now that HPA can communicate the time until its deferred work should be done,
this option is not used anymore.
2021-09-17 16:56:41 -07:00
Alex Lapenkou
8229cc77c5 Wake up background threads on demand
This change allows every allocator conforming to PAI communicate that it
deferred some work for the future. Without it if a background thread goes into
indefinite sleep, there is no way to notify it about upcoming deferred work.
2021-09-17 16:56:41 -07:00
Alex Lapenkou
97da57c13a HPA: Add min_purge_interval_ms option
This rate limiting option is required to avoid purging too often.
2021-09-17 16:56:41 -07:00
Alex Lapenkou
b8b8027f19 Allow PAI to calculate time until deferred work
Previously the calculation of sleep time between wakeups was implemented within
background_thread. This resulted in some parts of decay and hpa specific
logic mixing with background thread implementation. In this change, background
thread delegates this calculation to arena and it, in turn, delegates it to PAI.
The next step is to implement the actual calculation of time until deferred work
in HPA.
2021-09-17 16:56:41 -07:00
Qi Wang
8b24cb8fdf Don't assume initialized arena in the default alloc hook.
Specifically, this change allows the default alloc hook to used during
arenas.create.  One use case is to invoke the default alloc hook in a customized
hook arena, i.e. the default hooks can be read out of a default arena, then
create customized ones based on these hooks.  Note that mixing the default with
customized hooks is not recommended, and should only be considered when the
customization is simple and straightforward.
2021-08-25 14:19:25 -07:00
Qi Wang
5884a076fb Rename prof.dump_prefix to prof.prefix
This better aligns with our naming convention.  The option has not been included
in any upstream release yet.
2021-08-12 23:04:29 -07:00
Alex Lapenkou
f58064b932 Verify that HPA is used before calling its functions
This change eliminates the possibility of PA calling functions of uninitialized
HPA.
2021-08-05 16:43:28 -07:00
David Goldblatt
27f71242b7 Mutex: Tweak internal spin count.
The recent pairing heap optimizations flattened the lock hold time profile.
This was a win for raw cycle counts, but ended up causing us to "just miss"
acquiring the mutex before sleeping more often.  Bump those counts.
2021-08-05 14:33:16 -07:00