1 files changed, 103 insertions, 0 deletions
diff --git a/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc b/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc
new file mode 100644
index 0000000000..814598c6aa
--- /dev/null
+++ b/website/content/en/status/report-2021-10-2021-12/membarrier-rseq.adoc
@@ -0,0 +1,103 @@
+=== sched_getcpu(2), membarrier(2), and rseq(2) syscalls
+
+Contact: Konstantin Belousov <kib@FreeBSD.org>
+
+Links: +
+link:https://kib.kiev.ua/kib/membarrier.pdf[Linux manpage for membarrier(2)] URL: link:https://kib.kiev.ua/kib/membarrier.pdf[https://kib.kiev.ua/kib/membarrier.pdf] +
+link:https://reviews.freebsd.org/D32360[membarrier(2) implementation] URL: link:https://reviews.freebsd.org/D32360[https://reviews.freebsd.org/D32360] +
+link:https://kib.kiev.ua/kib/rseq.pdf[Linux manpage for rseq(2)] URL: link:https://kib.kiev.ua/kib/rseq.pdf[https://kib.kiev.ua/kib/rseq.pdf] +
+link:https://reviews.freebsd.org/D32505[rseq(2) and userspace bindings implementation] URL: link:https://reviews.freebsd.org/D32505[https://reviews.freebsd.org/D32505]
+
+Linux provides a set of syscalls that allow to develop mostly
+syscall-less scalable algorithms in userspace.  The mechanisms are
+based on optimistic execution using CPU-local data with the assumption that
+rare events like context switches or signal delivery do not occur
+for the given calculation, and if they do occur, rollback and restart
+is performed.  This very high-level approach is used, as I understand,
+for implementation of tools like URCU, fast malloc allocators
+(tcmalloc) and other userspace infrastructure projects aimed at
+large partitioned machines.
+
+For instance, sched_getcpu(2) syscall returns the CPU id of the CPU
+where the current thread is currently executing.  On amd64, if
+available, we use a RDTSCP or RDPID instruction to query the CPU id without
+changing CPU mode, otherwise this is a light-weight syscall.  Of
+course, the answer provided is obsolete the moment it is created,
+even before it is returned to userspace.  But it allows seeding values
+in some structures that are valid for a long time (at the
+CPU speed scale) and are automatically corrected on exceptional
+control flow events like context switches, and userspace can either detect
+and rollback or sync and rollback with the exceptions.
+
+There are two cornerstone syscalls that allow userspace to implement
+these efficient algorithms: membarrier(2) and rseq(2).
+
+Membarrier is a facility that helps implementing fast CPU ordering
+barriers, typically used for asymmetric/biased locking.  In these lock
+implementation schemes, the owner of the object often assumes that there
+are contenders/parallel threads that need coordinating with.  If some
+thread starts accessing the same resource, then it is its duty to
+ensure correctness.  Examples of 'traps' that fast code path
+utilize are reads from a dedicated page that is unmapped by contenders,
+to switch the fast path to the slow one.  Or we could send a signal to all
+threads that potentially have access to that object, to insert a
+barrier.  Or we can use the membarrier(2) facility, which incurs
+significantly less overhead than signalling all threads.
+
+Membarrier(2) inserts a barrier, which is the typical underlying
+hardware operation to ensure ordering, into the specified set of CPUs,
+if these CPUs are executing the specified thread.  If these CPUs are not executing
+the targeted threads, it is assumed that sequential consistency guarantees
+from the context switch are enough to fulfill the requirement of
+membarrier(2).  Overall, the fast path can be implemented without slow
+instructions, and the slow path injects required fences into the fast path at
+the cost of IPI.
+
+The facility to detect exceptional conditions in the userspace thread
+execution was developed in Linux and called rseq(2).  It is a feature
+often called Restartable Atomic Sequences, which explains the acronym.
+The ability to cheaply do that allows code longer than a single
+instruction to execute atomically, without the need to propose and
+implement unsafe operations like disabling preemption, which is not
+feasible for userspace.  For instance, code might use CPU-local
+resources, which otherwise does not cope well with context switches.
+There cannot be an analog of critical_enter(9) in userspace.  (A
+facility to cheaply block signal delivery exists in FreeBSD, see
+sigfastblock(2), but correctly using it is provably too hard to
+implement in general-purpose code, esp. because it requires
+version-dependent coordination with rtdl and libthr.)
+
+rseq(2) takes per-thread block of memory, where the thread writes the
+current CPU id (see sched_getcpu(2)) and specifies the block of
+critical code that must be unwound if an exceptional situation like a
+context switch occurred while the block was executing.  The fast code
+path uses per-cpu data and typically does not need any corrections,
+but would a context switch occur, transfer of control to the abort
+handler informs userspace about the event.  So instead of disabling
+context switches, code can cheaply check for one after the calculation
+and retry if needed.
+
+An interesting rseq(2) implementation detail is that it is
+impossible (and not needed) to access/update rseq structures from
+kernel during the actual context switch, because we cannot access
+userspace from under a spinlock.  In other words,
+threads using rseq do not incur any performance cost from
+system-global context switches.  Instead, if the process registered for
+rseq(2), on any return to user mode we check if any exceptional
+events happened while the thread was in the kernel (context switches may happen
+only while the thread is in kernel mode), and if a context switch indeed
+occurred, we fire an ast to check whether the program counter is inside the
+critical section and jump to the abort handler if it is.
+
+The implementations of membarrier(2) and rseq(2) are clean-room: I used
+Linux manual pages as the reference and public discussions of the
+features for clarifying corner cases.  On Linux/glibc, there was no
+stable glibc interface to the rseq facility.  One proposed integration was
+committed then reverted from glibc.  It might be prudent to wait
+some more for the rseq(2) interface to stabilize in glibc before providing
+it in our libc or to rely on tight integration between kernel
+and userspace in our base system, and use ABI tricks like symbol
+versioning to evolve the interface.  There is no goal to be 100%
+compatible with Linux anyway.
+
+Sponsor: The FreeBSD Foundation