| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce vn_fullpath_jail(), which returns a path to the passed vnode
relative to the current jail's root. It will be used by mac_do(4) in
a subsequent commit.
Factor out common code between the new variant and vn_fullpath(). While
here, rework the comments a bit.
Add vn_fullpath_jail() to the vn_fullpath.9 manual page. While here,
document all the existing public vn_fullpath*() functions.
Reviewed by: kib (except latest manual page changes)
MFC after: 3 days
Event: EuroBSDCon 2025
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D52757
|
|
|
|
|
|
|
|
|
|
| |
sdt hotpatching was implemented, thus a remark about usefulnes of doing
it was removed.
Apart from that a bunch of expanded/reworded explanations.
Improvement in terms of the quality of the use the English language
was a non-goal and was most likely not achieved.
|
|
|
|
|
|
| |
since sdt probes started being hot patched
This eliminates a now spurious branch on fpl.status
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. The definition lists struct nameidata as the type of the first
argument. However, the actual probes always pass a variable of type
struct nameidata* to SDT_PROBE3.
2. The third argument (args[2]) is actually enum cache_fpl_status.
Reviewed by: markj
Approved by: markj (mentor)
Fixes: 07d2145a1717 vfs: add the infrastructure for lockless lookup
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D51315
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add an implementation of inotify_init(), inotify_add_watch(),
inotify_rm_watch(), source-compatible with Linux. This provides
functionality similar to kevent(2)'s EVFILT_VNODE, i.e., it lets
applications monitor filesystem files for accesses. Compared to
inotify, however, EVFILT_VNODE has the limitation of requiring the
application to open the file to be monitored. This means that activity
on a newly created file cannot be monitored reliably, and that a file
descriptor per file in the hierarchy is required.
inotify on the other hand allows a directory and its entries to be
monitored at once. It introduces a new file descriptor type to which
"watches" can be attached; a watch is a pseudo-file descriptor
associated with a file or directory and a set of events to watch for.
When a watched vnode is accessed, a description of the event is queued
to the inotify descriptor, readable with read(2). Events for files in a
watched directory include the file name.
A watched vnode has its usecount bumped, so name cache entries
originating from a watched directory are not evicted. Name cache
entries are used to populate inotify events for files with a link in a
watched directory. In particular, if a file is accessed with, say,
read(2), an IN_ACCESS event will be generated for any watched hard link
of the file.
The inotify_add_watch_at() variant is included so that this
functionality is available in capability mode; plain inotify_add_watch()
is disallowed in capability mode.
When a file in a nullfs mount is watched, the watch is attached to the
lower vnode, such that accesses via either layer generate inotify
events.
Many thanks to Gleb Popov for testing this patch and finding lots of
bugs.
PR: 258010, 215011
Reviewed by: kib
Tested by: arrowd
MFC after: 3 months
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D50315
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The O_RESOLVE_BENEATH openat(2) flag restricts name lookups such that
they remain under the directory referenced by the dirfd. This commit
introduces an implicit version of the flag, FD_RESOLVE_BENEATH, stored
in the file descriptor entry. When the flag is set, any lookup relative
to that fd automatically has O_RESOLVE_BENEATH semantics. Furthermore,
the flag is sticky, meaning that it cannot be cleared, and it is copied
by dup() and openat().
File descriptors with FD_RESOLVE_BENEATH set may not be passed to
fchdir(2) or fchroot(2). Various fd lookup routines are modified to
return fd flags to the caller.
This flag will be used to address a case where jails with different root
directories and the ability to pass SCM_RIGHTS messages across the jail
boundary can transfer directory fds in such as way as to allow a
filesystem escape.
PR: 262180
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D50371
|
|
|
|
|
|
|
|
|
| |
This truncation is mostly harmless today, but fix it anyway to avoid
pain later down the road.
Reviewed by: olce, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D50417
|
|
|
|
|
|
|
|
|
| |
Otherwise the lockless name lookup path is inadvertently disabled since
NAMEILOOKUP isn't recognized.
Reviewed by: olce, kib
Fixes: 7587f6d4840f ("namei: Make stackable filesystems check harder for jail roots")
Differential Revision: https://reviews.freebsd.org/D50532
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2ec2ba7e232d added some code to cache_can_fplookup()
which worked (ensuring an abort when OPENNNAMED was set),
but showed I didn't understand what
CACHE_FPL_SUPPORTED_CN_FLAGS was used for.
This patch cleans it up.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D50524
Fixes: 2ec2ba7e232d ("vfs: Add VFS/syscall support for Solaris style extended attributes")
|
|
|
|
|
| |
Reported by: bapt
Fixes: 7587f6d4840f ("namei: Make stackable filesystems check harder for jail roots")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Suppose a process has its cwd pointing to a nullfs directory, where the
lower directory is also visible in the jail's filesystem namespace.
Suppose that the lower directory vnode is moved out from under the
nullfs mount. The nullfs vnode still shadows the lower vnode, and
dotdot lookups relative to that directory will instantiate new nullfs
vnodes outside of the nullfs mountpoint, effectively shadowing the lower
filesystem.
This phenomenon can be abused to escape a chroot, since the nullfs
vnodes instantiated by these dotdot lookups defeat the root vnode check
in vfs_lookup(), which uses vnode pointer equality to test for the
process root.
Fix this by extending nullfs and unionfs to perform the same check,
exploiting the fact that the passed componentname is embedded in a
nameidata structure to avoid changing the VOP_LOOKUP interface. That
is, add a flag to indicate that containerof can be used to get the full
nameidata structure, and perform the root vnode check on the lower vnode
when performing a dotdot lookup.
PR: 262180
Reviewed by: olce, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D50418
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce two helpers, the more general SYSCTL_SIZEOF() and
a struct-specific one SYSCTL_SIZEOF_STRUCT() which prepends 'struct' in
the description and in the use of sizeof() but uses the raw structure
name as the knob's name. The size of the object/structure is exported
under 'debug.sizeof'.
Existing knobs under 'debug.sizeof' were all converted to use the
helpers.
Add a note before the helpers discouraging the introduction of new
leaves for ad-hoc reasons. List alternative means for developers to
obtain the size of arbitrary kernel structures easily (thanks to markj@
for providing these).
No functional change (intended).
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50121
|
|
|
|
|
|
|
| |
Reviewed by: markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50120
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The condition `flag == NFC_ISDOTDOT && vp != NULL && vp->v_type != VDIR`
is never true at this point in the function. This is asserted slightly
earlier. So, remove some dead code and simplify control flow.
N.B. we set v_cache_dd for all vnode types, not just VDIR. This seems
to be intentional, see commit ce575cd0e2f9069. For regular files it
appears to effectively represent the most recently entered cache entry
for the vnode.
No functional change intended.
Reviewed by: olce, kib
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D50107
|
|
|
|
|
|
|
|
|
| |
No functional change intended.
Reviewed by: olce, kib
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D50106
|
|
|
|
|
|
|
|
|
| |
No functional change intended.
Reviewed by: olce, kib
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D50105
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some systems, such as Solaris, represent extended attributes as
a set of files in a directory associated with a file object. This
allows extended attributes to be acquired/modified via regular
file system operations, such as read(2), write(2), lseek(2) and
ftruncate(2).
Since ZFS already has the capability to do this, this patch allows
system calls (and the NFSv4 client/server) such access to extended
attributes.
This permits handling of large extended attributes and allows the NFSv4
server to provide the service to NFSv4 clients that want it, such as
Windows, MacOS and Solaris.
The top level syscall change is a new open(2)/openat(2) flag I called
O_NAMEDATTR that allows the named attribute directory or any attribute
within that directory to be open'd.
The patch defines two new v_irflag flags called VIRF_NAMEDDIR and
VIRF_NAMEDATTR to indicate that the vnode is for this alternate name
space and not a normal file object.
The patch also defines flags (OPENNAMED and CREATENAMED) for VOP_LOOKUP()
to pass this new case down into VOP_LOOKUP() and MNT_NAMEDATTR for file
systems that support named attributes.
Most of the code in this patch is to avoid creation of links, symlinks
or non-regular file objects in the named attribute directory.
It also must avoid using the name cache, since the named attribute
directory is associated with the same name as the file object.
Man pages updates will be done as separate commits.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D49583
|
|
|
|
|
|
|
| |
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D47739
|
|
|
|
|
|
|
|
| |
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D47739
|
|
|
|
|
|
|
| |
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D47739
|
|
|
|
| |
Sponsored by: Rubicon Communications, LLC ("Netgate")
|
|
|
|
|
|
|
|
|
|
|
| |
Report namei path lookups while Capsicum violation tracing with
CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic
capability mode behavior.
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40680
|
|
|
|
|
|
|
|
|
| |
The fd is always obtained from nameidata, so just fetch it from there
instead. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43257
|
|
|
|
|
|
|
|
| |
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for non-native ABI
This is a temporary solution to fix PR before release.
During 15.0 it's necessary to refactor symlinks handling
between vfs & namecache.
PR: 273414
Reported by: Vincent Milum Jr, Dan Kotowski, glebius
Tested by: Dan Kotowski, glebius
Reviewed by:
Differential Revision: https://reviews.freebsd.org/D41806
MFC after: 3 days
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
cache_zap_unlocked_bucket is called with a bunch of addresses and
without any locks held, forcing it to revalidate everything from
scratch.
It did not account for a case where the entry is reallocated with
everything the same except for the target vnode.
Should the target use a different lock than the one expected, freeing
would proceed without being properly synchronized.
Note this is almost impossible to happen in practice.
|
|
|
|
|
|
|
| |
They are very rarely triggered, so no need for per-cpu distribution.
At the same time the non-cpu ones still should use atomics to not lose
any updates.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While here report a sample result from running on Sapphire Rapids:
An access(2) loop slapped into will-it-scale, like so:
while (1) {
int error = access(tmpfile, R_OK);
assert(error == 0);
(*iterations)++;
}
.. operating on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c
In operations per second:
lockless: 3462164
locked: 1362376
While the over 3.4 mln may seem like a big number, a critical look shows
it should be significantly higher.
A poor man's profiler, counting how many times given routine was sampled:
dtrace -w -n 'profile:::profile-4999 /execname == "a.out"/ {
@[sym(arg0)] = count(); } tick-5s { system("clear"); trunc(@, 40);
printa("%40a %@16d\n", @); clear(@); }'
[snip]
kernel`kern_accessat 231
kernel`cpu_fetch_syscall_args 324
kernel`cache_fplookup_cross_mount 340
kernel`namei 346
kernel`amd64_syscall 352
kernel`tmpfs_fplookup_vexec 388
kernel`vput 467
kernel`vget_finish 499
kernel`lockmgr_unlock 529
kernel`lockmgr_slock 558
kernel`vget_prep_smr 571
kernel`vput_final 578
kernel`vdropl 1070
kernel`memcmp 1174
kernel`0xffffffff80 2080
0x0 2231
kernel`copyinstr_smap 2492
kernel`cache_fplookup 9246
|
| |
|
|
|
|
|
|
| |
They demonstrate nothing, and in case of dotdot they are not even hits.
This is just a count of lookups with "..", which are not worth
mentioniong.
|
|
|
|
|
| |
It was not meant to be writable and writes don't work correctly as they
fail to resize the hash.
|
|
|
|
|
|
|
|
|
| |
'ncnegminpct' is to be passed always, so just drop the unneeded parameter.
Sponsored by: The FreeBSD Foundation
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D41763
|
|
|
|
|
| |
The conditions it checks cannot legally be true (modulo races against
forced unmount), so assert on it instead.
|
|
|
|
|
|
|
|
|
|
| |
Overflow in cache_changesize would make the value flip to 0 and stay
there as 0 << 1 does not do anything.
Note callers limit the outcome to something below u_int.
Also note there entire vnode handling thing both in vfs layer as a whole
and this file can't decide whether to long, u_long or u_int.
|
|
|
|
| |
Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
|
|
|
|
|
| |
Reviewed by:
Differential Revision: https://reviews.freebsd.org/D41223
MFC after: 1 week
|
|
|
|
|
|
| |
This whacks hackery around only reading v_type once.
Bump __FreeBSD_version to 1400093
|
|
|
|
|
|
|
|
|
| |
before calling vn_fullpath_hardlink(). Otherwise we get random failures
when the len is automatically clipped.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
|
|
|
|
|
|
| |
Reported by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For now a non-native ABI (i.e., Linux) uses the kern_alternate_path()
facility to dynamically reroot lookups. First, an attempt is made to
lookup the file in /compat/linux/original-path. If that fails, the
lookup is done in /original-path. Thats requires a bit of code in
every ABI syscall implementation where path name translation is needed.
Also our kern_alternate_path() does not properly lookups absolute symlinks
in second attempt, i.e., does not append /compat/linux part to the resolved
link.
The change is intended to avoid this by specifiyng the ABI root directory
for namei(), using one call to pwd_altroot() during exec-time into the ABI.
In that case namei() will dynamically reroot lookups as mentioned above.
PR: 72920
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D38933
MFC after: 2 month
|
|
|
|
|
|
|
|
|
| |
Two vfs.cache.stats names are fixed:
- s/.dotdothis/.dotdothits/
- s/.posszaps/.poszaps/
Signed-off-by: Igor Ostapenko <pm@igoro.pro>
[mjg: massaged the header a little bit]
|
| |
|
|
|
|
| |
In order to prevent later susprises.
|
| |
|
|
|
|
|
|
| |
Reported by: Oliver Kiddle
PR: 270419
MFC: 3 days
|
|
|
|
|
| |
Reported by: kib
Sponsored by: Rubicon Communications, LLC ("Netgate")
|
|
|
|
|
|
|
|
|
|
|
|
| |
For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.
This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.
Reviewed by: mjg, kib
Tested by: pho
|