aboutsummaryrefslogtreecommitdiff
path: root/sys/kern/vfs_cache.c
Commit message (Collapse)AuthorAgeFilesLines
* vfs cache: Add vn_fullpath_jail(), factor out common codeOlivier Certner11 days1-8/+44
| | | | | | | | | | | | | | | | | | Introduce vn_fullpath_jail(), which returns a path to the passed vnode relative to the current jail's root. It will be used by mac_do(4) in a subsequent commit. Factor out common code between the new variant and vn_fullpath(). While here, rework the comments a bit. Add vn_fullpath_jail() to the vn_fullpath.9 manual page. While here, document all the existing public vn_fullpath*() functions. Reviewed by: kib (except latest manual page changes) MFC after: 3 days Event: EuroBSDCon 2025 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D52757
* vfs cache: update commentary, no code changesMateusz Guzik2025-09-131-53/+109
| | | | | | | | | | sdt hotpatching was implemented, thus a remark about usefulnes of doing it was removed. Apart from that a bunch of expanded/reworded explanations. Improvement in terms of the quality of the use the English language was a non-goal and was most likely not achieved.
* vfs cache: drop SDT_PROBES_ENABLED usageMateusz Guzik2025-07-161-7/+3
| | | | | | since sdt probes started being hot patched This eliminates a now spurious branch on fpl.status
* vfs_cache: Fix the SDT definition of vfs:fplookup:lookup:doneMateusz Piotrowski2025-07-151-1/+2
| | | | | | | | | | | | | 1. The definition lists struct nameidata as the type of the first argument. However, the actual probes always pass a variable of type struct nameidata* to SDT_PROBE3. 2. The third argument (args[2]) is actually enum cache_fpl_status. Reviewed by: markj Approved by: markj (mentor) Fixes: 07d2145a1717 vfs: add the infrastructure for lockless lookup MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D51315
* vfs: Initial revision of inotifyMark Johnston2025-07-041-0/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an implementation of inotify_init(), inotify_add_watch(), inotify_rm_watch(), source-compatible with Linux. This provides functionality similar to kevent(2)'s EVFILT_VNODE, i.e., it lets applications monitor filesystem files for accesses. Compared to inotify, however, EVFILT_VNODE has the limitation of requiring the application to open the file to be monitored. This means that activity on a newly created file cannot be monitored reliably, and that a file descriptor per file in the hierarchy is required. inotify on the other hand allows a directory and its entries to be monitored at once. It introduces a new file descriptor type to which "watches" can be attached; a watch is a pseudo-file descriptor associated with a file or directory and a set of events to watch for. When a watched vnode is accessed, a description of the event is queued to the inotify descriptor, readable with read(2). Events for files in a watched directory include the file name. A watched vnode has its usecount bumped, so name cache entries originating from a watched directory are not evicted. Name cache entries are used to populate inotify events for files with a link in a watched directory. In particular, if a file is accessed with, say, read(2), an IN_ACCESS event will be generated for any watched hard link of the file. The inotify_add_watch_at() variant is included so that this functionality is available in capability mode; plain inotify_add_watch() is disallowed in capability mode. When a file in a nullfs mount is watched, the watch is attached to the lower vnode, such that accesses via either layer generate inotify events. Many thanks to Gleb Popov for testing this patch and finding lots of bugs. PR: 258010, 215011 Reviewed by: kib Tested by: arrowd MFC after: 3 months Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D50315
* file: Add a fd flag with O_RESOLVE_BENEATH semanticsMark Johnston2025-06-241-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | The O_RESOLVE_BENEATH openat(2) flag restricts name lookups such that they remain under the directory referenced by the dirfd. This commit introduces an implicit version of the flag, FD_RESOLVE_BENEATH, stored in the file descriptor entry. When the flag is set, any lookup relative to that fd automatically has O_RESOLVE_BENEATH semantics. Furthermore, the flag is sticky, meaning that it cannot be cleared, and it is copied by dup() and openat(). File descriptors with FD_RESOLVE_BENEATH set may not be passed to fchdir(2) or fchroot(2). Various fd lookup routines are modified to return fd flags to the caller. This flag will be used to address a case where jails with different root directories and the ability to pass SCM_RIGHTS messages across the jail boundary can transfer directory fds in such as way as to allow a filesystem escape. PR: 262180 Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D50371
* namei: Fix cn_flags width in various placesMark Johnston2025-05-271-1/+1
| | | | | | | | | This truncation is mostly harmless today, but fix it anyway to avoid pain later down the road. Reviewed by: olce, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D50417
* vfs cache: Add NAMEILOOKUP to the whitelist of fastpath lookup flagsMark Johnston2025-05-271-1/+1
| | | | | | | | | Otherwise the lockless name lookup path is inadvertently disabled since NAMEILOOKUP isn't recognized. Reviewed by: olce, kib Fixes: 7587f6d4840f ("namei: Make stackable filesystems check harder for jail roots") Differential Revision: https://reviews.freebsd.org/D50532
* vfs_cache.c: Use CACHE_FPL_SUPPORTED_CN_FLAGSRick Macklem2025-05-261-5/+1
| | | | | | | | | | | | | Commit 2ec2ba7e232d added some code to cache_can_fplookup() which worked (ensuring an abort when OPENNNAMED was set), but showed I didn't understand what CACHE_FPL_SUPPORTED_CN_FLAGS was used for. This patch cleans it up. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D50524 Fixes: 2ec2ba7e232d ("vfs: Add VFS/syscall support for Solaris style extended attributes")
* namei: Remove a now-unused variableMark Johnston2025-05-231-3/+1
| | | | | Reported by: bapt Fixes: 7587f6d4840f ("namei: Make stackable filesystems check harder for jail roots")
* namei: Make stackable filesystems check harder for jail rootsMark Johnston2025-05-231-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Suppose a process has its cwd pointing to a nullfs directory, where the lower directory is also visible in the jail's filesystem namespace. Suppose that the lower directory vnode is moved out from under the nullfs mount. The nullfs vnode still shadows the lower vnode, and dotdot lookups relative to that directory will instantiate new nullfs vnodes outside of the nullfs mountpoint, effectively shadowing the lower filesystem. This phenomenon can be abused to escape a chroot, since the nullfs vnodes instantiated by these dotdot lookups defeat the root vnode check in vfs_lookup(), which uses vnode pointer equality to test for the process root. Fix this by extending nullfs and unionfs to perform the same check, exploiting the fact that the passed componentname is embedded in a nameidata structure to avoid changing the VOP_LOOKUP interface. That is, add a flag to indicate that containerof can be used to get the full nameidata structure, and perform the root vnode check on the lower vnode when performing a dotdot lookup. PR: 262180 Reviewed by: olce, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D50418
* sysctl(9): Ease exporting struct sizes; Discourage doing thatOlivier Certner2025-05-071-2/+1
| | | | | | | | | | | | | | | | | | | | | | | Introduce two helpers, the more general SYSCTL_SIZEOF() and a struct-specific one SYSCTL_SIZEOF_STRUCT() which prepends 'struct' in the description and in the use of sizeof() but uses the raw structure name as the knob's name. The size of the object/structure is exported under 'debug.sizeof'. Existing knobs under 'debug.sizeof' were all converted to use the helpers. Add a note before the helpers discouraging the introduction of new leaves for ad-hoc reasons. List alternative means for developers to obtain the size of arbitrary kernel structures easily (thanks to markj@ for providing these). No functional change (intended). Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50121
* VFS cache: Fix initial sizing for non-default 'ncsizefactor'Olivier Certner2025-05-061-1/+1
| | | | | | | Reviewed by: markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50120
* vfs cache: Simplify cache_enter_time() a bitMark Johnston2025-05-031-21/+13
| | | | | | | | | | | | | | | | | | The condition `flag == NFC_ISDOTDOT && vp != NULL && vp->v_type != VDIR` is never true at this point in the function. This is asserted slightly earlier. So, remove some dead code and simplify control flow. N.B. we set v_cache_dd for all vnode types, not just VDIR. This seems to be intentional, see commit ce575cd0e2f9069. For regular files it appears to effectively represent the most recently entered cache entry for the vnode. No functional change intended. Reviewed by: olce, kib MFC after: 2 weeks Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D50107
* vfs cache: Move hash row lookup loops into a subroutineMark Johnston2025-05-031-65/+64
| | | | | | | | | No functional change intended. Reviewed by: olce, kib MFC after: 2 weeks Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D50106
* vfs cache: Add a predicate for testing cache entriesMark Johnston2025-05-031-20/+20
| | | | | | | | | No functional change intended. Reviewed by: olce, kib MFC after: 2 weeks Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D50105
* vfs: Add VFS/syscall support for Solaris style extended attributesRick Macklem2025-04-021-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some systems, such as Solaris, represent extended attributes as a set of files in a directory associated with a file object. This allows extended attributes to be acquired/modified via regular file system operations, such as read(2), write(2), lseek(2) and ftruncate(2). Since ZFS already has the capability to do this, this patch allows system calls (and the NFSv4 client/server) such access to extended attributes. This permits handling of large extended attributes and allows the NFSv4 server to provide the service to NFSv4 clients that want it, such as Windows, MacOS and Solaris. The top level syscall change is a new open(2)/openat(2) flag I called O_NAMEDATTR that allows the named attribute directory or any attribute within that directory to be open'd. The patch defines two new v_irflag flags called VIRF_NAMEDDIR and VIRF_NAMEDATTR to indicate that the vnode is for this alternate name space and not a normal file object. The patch also defines flags (OPENNAMED and CREATENAMED) for VOP_LOOKUP() to pass this new case down into VOP_LOOKUP() and MNT_NAMEDATTR for file systems that support named attributes. Most of the code in this patch is to avoid creation of links, symlinks or non-regular file objects in the named attribute directory. It also must avoid using the name cache, since the named attribute directory is associated with the same name as the file object. Man pages updates will be done as separate commits. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D49583
* kern___realpathat(): honor uio_seg argumentKonstantin Belousov2024-11-251-1/+9
| | | | | | | Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D47739
* kern___realpathat(): do not copyout past end of stringKonstantin Belousov2024-11-251-1/+1
| | | | | | | | Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D47739
* kern___realpathat(): styleKonstantin Belousov2024-11-251-3/+5
| | | | | | | Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D47739
* vfs cache: add sysctl vfs.cache.param.hitpctMateusz Guzik2024-07-081-0/+20
| | | | Sponsored by: Rubicon Communications, LLC ("Netgate")
* ktrace: Record namei violations with KTR_CAPFAILJake Freeland2024-04-071-1/+1
| | | | | | | | | | | Report namei path lookups while Capsicum violation tracing with CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic capability mode behavior. Reviewed by: markj Approved by: markj (mentor) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D40680
* file: Remove the fd parameter to fgetvp_lookup() and fgetvp_lookup_smr()Mark Johnston2024-01-041-1/+1
| | | | | | | | | The fd is always obtained from nameidata, so just fetch it from there instead. No functional change intended. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43257
* sys: Remove ancient SCCS tags.Warner Losh2023-11-271-2/+0
| | | | | | | | Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
* vfs cache: Fallback to namei to resolve symlinks with leading / in target ↵Dmitry Chagin2023-10-191-0/+5
| | | | | | | | | | | | | | | for non-native ABI This is a temporary solution to fix PR before release. During 15.0 it's necessary to refactor symlinks handling between vfs & namecache. PR: 273414 Reported by: Vincent Milum Jr, Dan Kotowski, glebius Tested by: Dan Kotowski, glebius Reviewed by: Differential Revision: https://reviews.freebsd.org/D41806 MFC after: 3 days
* vfs cache: add 2 more optimizaiton ideasMateusz Guzik2023-10-051-0/+14
|
* vfs cache: denote a known bug in cache_remove_cnpMateusz Guzik2023-10-051-0/+9
|
* vfs cache: plug a hypothetical corner case when freeingMateusz Guzik2023-10-051-7/+18
| | | | | | | | | | | | | | cache_zap_unlocked_bucket is called with a bunch of addresses and without any locks held, forcing it to revalidate everything from scratch. It did not account for a case where the entry is reallocated with everything the same except for the target vnode. Should the target use a different lock than the one expected, freeing would proceed without being properly synchronized. Note this is almost impossible to happen in practice.
* vfs cache: sanitize debug countersMateusz Guzik2023-10-051-12/+9
| | | | | | | They are very rarely triggered, so no need for per-cpu distribution. At the same time the non-cpu ones still should use atomics to not lose any updates.
* vfs cache: describe various optimization ideasMateusz Guzik2023-10-031-2/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While here report a sample result from running on Sapphire Rapids: An access(2) loop slapped into will-it-scale, like so: while (1) { int error = access(tmpfile, R_OK); assert(error == 0); (*iterations)++; } .. operating on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c In operations per second: lockless: 3462164 locked: 1362376 While the over 3.4 mln may seem like a big number, a critical look shows it should be significantly higher. A poor man's profiler, counting how many times given routine was sampled: dtrace -w -n 'profile:::profile-4999 /execname == "a.out"/ { @[sym(arg0)] = count(); } tick-5s { system("clear"); trunc(@, 40); printa("%40a %@16d\n", @); clear(@); }' [snip] kernel`kern_accessat 231 kernel`cpu_fetch_syscall_args 324 kernel`cache_fplookup_cross_mount 340 kernel`namei 346 kernel`amd64_syscall 352 kernel`tmpfs_fplookup_vexec 388 kernel`vput 467 kernel`vget_finish 499 kernel`lockmgr_unlock 529 kernel`lockmgr_slock 558 kernel`vget_prep_smr 571 kernel`vput_final 578 kernel`vdropl 1070 kernel`memcmp 1174 kernel`0xffffffff80 2080 0x0 2231 kernel`copyinstr_smap 2492 kernel`cache_fplookup 9246
* vfs cache: s/vfs.cache_fast_lookup/vfs.cache.param.fast_lookupMateusz Guzik2023-10-031-1/+1
|
* vfs cache: retire dothits and dotdothits countersMateusz Guzik2023-09-231-6/+0
| | | | | | They demonstrate nothing, and in case of dotdot they are not even hits. This is just a count of lookups with "..", which are not worth mentioniong.
* vfs cache: mark vfs.cache.param.size as read-onlyMateusz Guzik2023-09-221-1/+1
| | | | | It was not meant to be writable and writes don't work correctly as they fail to resize the hash.
* vfs cache: Drop known argument of internal cache_recalc_neg_min()Olivier Certner2023-09-221-5/+5
| | | | | | | | | 'ncnegminpct' is to be passed always, so just drop the unneeded parameter. Sponsored by: The FreeBSD Foundation Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D41763
* vfs cache: garbage collect the fullpathfail2 counterMateusz Guzik2023-09-141-9/+1
| | | | | The conditions it checks cannot legally be true (modulo races against forced unmount), so assert on it instead.
* vfs cache: fix a hang when bumping vnode limit too highMateusz Guzik2023-09-021-5/+5
| | | | | | | | | | Overflow in cache_changesize would make the value flip to 0 and stay there as 0 << 1 does not do anything. Note callers limit the outcome to something below u_int. Also note there entire vnode handling thing both in vfs layer as a whole and this file can't decide whether to long, u_long or u_int.
* sys: Remove $FreeBSD$: one-line .c patternWarner Losh2023-08-161-2/+0
| | | | Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
* vfs: Deleting a doubled inclusion of sys/capsicum.hDmitry Chagin2023-07-291-2/+0
| | | | | | Reviewed by: Differential Revision: https://reviews.freebsd.org/D41223 MFC after: 1 week
* vfs: use __enum_uint8 for vtype and vstateMateusz Guzik2023-07-051-9/+2
| | | | | | This whacks hackery around only reading v_type once. Bump __FreeBSD_version to 1400093
* vn_path_to_global_path_hardlink(): initialize lenKonstantin Belousov2023-07-041-0/+1
| | | | | | | | | before calling vn_fullpath_hardlink(). Otherwise we get random failures when the len is automatically clipped. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
* vn_path_to_global_path_hardlink(): avoid freeing non-initialized pointerKonstantin Belousov2023-07-041-1/+1
| | | | | | Reported by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
* vfs cache: restore sorted order of CACHE_FPL_SUPPORTED_CN_FLAGSMateusz Guzik2023-05-301-2/+2
|
* namei: Add the abilty for the ABI to specify an alternate root pathDmitry Chagin2023-05-291-2/+2
| | | | | | | | | | | | | | | | | | | For now a non-native ABI (i.e., Linux) uses the kern_alternate_path() facility to dynamically reroot lookups. First, an attempt is made to lookup the file in /compat/linux/original-path. If that fails, the lookup is done in /original-path. Thats requires a bit of code in every ABI syscall implementation where path name translation is needed. Also our kern_alternate_path() does not properly lookups absolute symlinks in second attempt, i.e., does not append /compat/linux part to the resolved link. The change is intended to avoid this by specifiyng the ABI root directory for namei(), using one call to pwd_altroot() during exec-time into the ABI. In that case namei() will dynamically reroot lookups as mentioned above. PR: 72920 Reviewed by: kib Differential revision: https://reviews.freebsd.org/D38933 MFC after: 2 month
* vfs cache: fix vfs.cache.stats.* name typosIgor Ostapenko2023-04-191-2/+2
| | | | | | | | | Two vfs.cache.stats names are fixed: - s/.dotdothis/.dotdothits/ - s/.posszaps/.poszaps/ Signed-off-by: Igor Ostapenko <pm@igoro.pro> [mjg: massaged the header a little bit]
* vfs: more informative panic for missing fplookup opsMateusz Guzik2023-04-071-2/+38
|
* vfs: validate that vop vectors provide all or none fplookup vopsMateusz Guzik2023-04-061-0/+34
| | | | In order to prevent later susprises.
* vfs cache: always assert on ndp->ni_resflagsMateusz Guzik2023-03-251-1/+1
|
* vfs cache: return ENOTDIR for not_a_dir/{.,..} lookupsMateusz Guzik2023-03-231-0/+11
| | | | | | Reported by: Oliver Kiddle PR: 270419 MFC: 3 days
* vfs cache: whack set-but-not-used warn in cache_purgevfsMateusz Guzik2023-02-211-1/+1
| | | | | Reported by: kib Sponsored by: Rubicon Communications, LLC ("Netgate")
* Allow realpath to work for file mountsDoug Rabson2022-12-191-2/+26
| | | | | | | | | | | | For file mounts, the directory vnode is not available from namei and this prevents the use of vn_fullpath_hardlink. In this case, we can use the vnode which was covered by the file mount with vn_fullpath. This also disallows file mounts over files with link counts greater than one to ensure a deterministic path to the mount point. Reviewed by: mjg, kib Tested by: pho