diff options
Diffstat (limited to 'website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc')
| -rw-r--r-- | website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc | 103 |
1 files changed, 103 insertions, 0 deletions
diff --git a/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc b/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc new file mode 100644 index 0000000000..814598c6aa --- /dev/null +++ b/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc @@ -0,0 +1,103 @@ +=== sched_getcpu(2), membarrier(2), and rseq(2) syscalls + +Contact: Konstantin Belousov <kib@FreeBSD.org> + +Links: + +link:https://kib.kiev.ua/kib/membarrier.pdf[Linux manpage for membarrier(2)] URL: link:https://kib.kiev.ua/kib/membarrier.pdf[https://kib.kiev.ua/kib/membarrier.pdf] + +link:https://reviews.freebsd.org/D32360[membarrier(2) implementation] URL: link:https://reviews.freebsd.org/D32360[https://reviews.freebsd.org/D32360] + +link:https://kib.kiev.ua/kib/rseq.pdf[Linux manpage for rseq(2)] URL: link:https://kib.kiev.ua/kib/rseq.pdf[https://kib.kiev.ua/kib/rseq.pdf] + +link:https://reviews.freebsd.org/D32505[rseq(2) and userspace bindings implementation] URL: link:https://reviews.freebsd.org/D32505[https://reviews.freebsd.org/D32505] + +Linux provides a set of syscalls that allow to develop mostly +syscall-less scalable algorithms in userspace. The mechanisms are +based on optimistic execution using CPU-local data with the assumption that +rare events like context switches or signal delivery do not occur +for the given calculation, and if they do occur, rollback and restart +is performed. This very high-level approach is used, as I understand, +for implementation of tools like URCU, fast malloc allocators +(tcmalloc) and other userspace infrastructure projects aimed at +large partitioned machines. + +For instance, sched_getcpu(2) syscall returns the CPU id of the CPU +where the current thread is currently executing. On amd64, if +available, we use a RDTSCP or RDPID instruction to query the CPU id without +changing CPU mode, otherwise this is a light-weight syscall. Of +course, the answer provided is obsolete the moment it is created, +even before it is returned to userspace. But it allows seeding values +in some structures that are valid for a long time (at the +CPU speed scale) and are automatically corrected on exceptional +control flow events like context switches, and userspace can either detect +and rollback or sync and rollback with the exceptions. + +There are two cornerstone syscalls that allow userspace to implement +these efficient algorithms: membarrier(2) and rseq(2). + +Membarrier is a facility that helps implementing fast CPU ordering +barriers, typically used for asymmetric/biased locking. In these lock +implementation schemes, the owner of the object often assumes that there +are contenders/parallel threads that need coordinating with. If some +thread starts accessing the same resource, then it is its duty to +ensure correctness. Examples of 'traps' that fast code path +utilize are reads from a dedicated page that is unmapped by contenders, +to switch the fast path to the slow one. Or we could send a signal to all +threads that potentially have access to that object, to insert a +barrier. Or we can use the membarrier(2) facility, which incurs +significantly less overhead than signalling all threads. + +Membarrier(2) inserts a barrier, which is the typical underlying +hardware operation to ensure ordering, into the specified set of CPUs, +if these CPUs are executing the specified thread. If these CPUs are not executing +the targeted threads, it is assumed that sequential consistency guarantees +from the context switch are enough to fulfill the requirement of +membarrier(2). Overall, the fast path can be implemented without slow +instructions, and the slow path injects required fences into the fast path at +the cost of IPI. + +The facility to detect exceptional conditions in the userspace thread +execution was developed in Linux and called rseq(2). It is a feature +often called Restartable Atomic Sequences, which explains the acronym. +The ability to cheaply do that allows code longer than a single +instruction to execute atomically, without the need to propose and +implement unsafe operations like disabling preemption, which is not +feasible for userspace. For instance, code might use CPU-local +resources, which otherwise does not cope well with context switches. +There cannot be an analog of critical_enter(9) in userspace. (A +facility to cheaply block signal delivery exists in FreeBSD, see +sigfastblock(2), but correctly using it is provably too hard to +implement in general-purpose code, esp. because it requires +version-dependent coordination with rtdl and libthr.) + +rseq(2) takes per-thread block of memory, where the thread writes the +current CPU id (see sched_getcpu(2)) and specifies the block of +critical code that must be unwound if an exceptional situation like a +context switch occurred while the block was executing. The fast code +path uses per-cpu data and typically does not need any corrections, +but would a context switch occur, transfer of control to the abort +handler informs userspace about the event. So instead of disabling +context switches, code can cheaply check for one after the calculation +and retry if needed. + +An interesting rseq(2) implementation detail is that it is +impossible (and not needed) to access/update rseq structures from +kernel during the actual context switch, because we cannot access +userspace from under a spinlock. In other words, +threads using rseq do not incur any performance cost from +system-global context switches. Instead, if the process registered for +rseq(2), on any return to user mode we check if any exceptional +events happened while the thread was in the kernel (context switches may happen +only while the thread is in kernel mode), and if a context switch indeed +occurred, we fire an ast to check whether the program counter is inside the +critical section and jump to the abort handler if it is. + +The implementations of membarrier(2) and rseq(2) are clean-room: I used +Linux manual pages as the reference and public discussions of the +features for clarifying corner cases. On Linux/glibc, there was no +stable glibc interface to the rseq facility. One proposed integration was +committed then reverted from glibc. It might be prudent to wait +some more for the rseq(2) interface to stabilize in glibc before providing +it in our libc or to rely on tight integration between kernel +and userspace in our base system, and use ABI tricks like symbol +versioning to evolve the interface. There is no goal to be 100% +compatible with Linux anyway. + +Sponsor: The FreeBSD Foundation |
