src - FreeBSD source tree

diff options

author	Jessica Clarke <jrtc27@FreeBSD.org>	2026-04-27 12:53:29 +0000
committer	Jessica Clarke <jrtc27@FreeBSD.org>	2026-04-27 12:53:29 +0000
commit	551d47c5677a5eaf0a1ed2ea3b2b1406b192764d (patch)
tree	6fb76692b1e3286ecd1dd553a0d2af59fb015ded /usr.bin/ref/ssh:/anongit@git.FreeBSD.org/(developers-only)
parent	fdea83a5f63b881ade91c6cc9dfbf173137512c5 (diff)

arm64: Ditch arm64-specific unsound PCPU optimisationHEAD main

The current arm64 PCPU implementation uses a global register asm variable to use x18, which we reserve with -ffixed-x18, from C. Inside a critical_enter() or sched_pin(), it is vital that any PCPU reads use the right PCPU pointer, as often the whole point of the critical_enter() or sched_pin() is to ensure consistent PCPU use (e.g. for SMR it relies on zpcpu giving the same SMR state). critical_enter() and sched_pin() both include atomic_interrupt_fence(), i.e. asm volatile("" ::: "memory"), barriers to ensure that memory accesses don't get moved by the compiler outside the critical section, which on most architectures will also order the read of the PCPU pointer itself (whether due to the read being another asm volatile statement, or due to using a segment-relative memory access as on x86). However, this approach on arm64 is in no sense a memory access, and therefore the register access is not ordered with respect to the the critical_enter() or sched_pin(), or more specifically the curthread->td_critnest++ / curthread->td_pinned++ within. In practice upstream today this works out ok because the read of x18 is inlined into the actual PCPU_GET/ADD/SET memory accesses (i.e. you will get something like ldr xN, [x18, #imm-or-xM] for PCPU_GET, etc.), and since *that* instruction is ordered properly due to being a memory access, the x18 ends up being read in the right place. However, that is not in any way guaranteed, it just relies on the hope that compiler optimisations will be perfect at inlining the use. Moreover, PCPU_PTR is definitely not a memory access in this world, it's just pointer arithmetic on x18, and so that has nothing ordering it. This can be observed with the following test function compiled into the kernel: void pcpu_test(void) { extern void __weak_symbol use_pcpu_ptr(void *); critical_enter(); use_pcpu_ptr(PCPU_PTR(curthread)); critical_exit(); } Obviously, this is a bit contrived as you could just read curthread directly via its atomic definition that bypasses any worries about PCPU atomicity, but it illustrates the point. With the in-tree LLVM*, this ends up being compiled for me to: paciasp stp x29, x30, [sp, #-0x10]! mov x29, sp ldr x8, [x18] ldr w9, [x8, #0x4fc] mov x0, x18 add w9, w9, #0x1 str w9, [x8, #0x4fc] bl use_pcpu_ptr ... Note that, although the PCPU_PTR was within the critical section in the C source, the read of x18 into x0, the argument register passed to use_pcpu_ptr, has been hoisted to before the str, which is storing the new, incremented, value of td_critnest to curthread, and so there is a window within which we have to hope the thread is not preempted and migrated to a different CPU, otherwise it will pass a pointer to the wrong CPU's pc_curthread PCPU member. Initially it would seem as though the solution to this would be to add an additional barrier to critical_enter() / sched_pin() to ensure the register reads could not be hoisted like this. However, I have not been able to find a sequence that works reliably across both GCC and Clang, independent of optimisation level. Using inline asm with x18 marked as a clobber, using "=r"(pcpup), and using "+r"(pcpup) all run into various issues; some combinations don't actually seem to be a barrier, and for Clang at -O0 some combinations will actually generate writes to x18**, at which point you then have to hope that the kernel is compiled with optimisations, and that the redundant writes are optimised away such that x18 is just passed through. But that just gets us back to hoping optimisation works, which isn't a solution to the problem, it just trades one point of fragility for another. In talking to GCC developers, who seemed rather horrified by the implications of trying to do this (which is effectively "register volatile", a combination that's explicitly forbidden), we could not find a solution to this, and so I have concluded that the only reliable to have a sound PCPU implementation is to ditch this optimisation and follow other non-x86 architectures in using inline asm in one form or another; specifically, this adopts riscv's approach of just calling get_pcpu(), which, curiously, was already implemented in inline asm here on arm64, rather than reading pcpup. Anyone who feels strongly enough about PCPU performance is welcome to try to find a working approach, but such proposals should be heavily scrutinised to be certain that they won't come back to bite us in future. In particular, this caused a lot of problems downstream in CheriBSD's experimental compartmentalised kernel, which is trialling interposing on PCPU accesses in order to restrict access within compartments. As a result, even PCPU_GET/SET/ADD can look like PCPU_PTR, as they pass an opaque PCPU reference to wrapper functions, and so this case gets hit all over the kernel, giving highly-confusing panics with locks that aren't owned by the current thread or SMR use allegedly not within an smr_enter(). The ia64 port encountered the same issue and reached the same conclusion in e31ece45b7a4 ("Fix the PCPU access macros."), though went to the trouble of trying to fold the offset into the inline assembly (assuming it fit, with no fallback if not, since it's using the add pseudo-op that will be expanded to either adds with a 14-bit immediate or, if somehow that doesn't fit, addl with a 22-bit immediate). Curiously though it left pcpup around as a footgun. sparc64 had similar code but was never fixed. It also defined a curpcb in the same manner which was presumably similarly broken, but looks to have been entirely unreferenced from C, only referenced in actual assembly files. Alpha also had the same design, but it was removed whilst critical_enter() was extern rather than static inline so uses of the pointer could not have been hoisted, and whilst sched_pin() didn't have any form of atomic_interrupt_fence() to even try to make PCPU well-ordered. * At time of writing, when that was LLVM 19, not verified at time of commit with LLVM 21. ** For "+r"(pcpup), Clang's initial code generation is to do: mov xTtmp1, x18 mov x18, xTmp1 /* asm (empty) */ mov xTmp2, x18 mov x18, xTmp2 since its interpretation of what that means is "read the value of pcpup, and make sure that value is in x18 for the duration of the assembly due to the asm("x18") on pcpup", and similarly for the output side. Reviewed by: andrew, jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D56601

Diffstat (limited to 'usr.bin/ref/ssh:/anongit@git.FreeBSD.org/(developers-only)')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: