aboutsummaryrefslogtreecommitdiff
path: root/sys
Commit message (Collapse)AuthorAgeFilesLines
* Fix dst/netmask handling in routing socket code.Alexander V. Chernikov2021-02-161-6/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Traditionally routing socket code did almost zero checks on the input message except for the most basic size checks. This resulted in the unclear KPI boundary for the routing system code (`rtrequest*` and now `rib_action()`) w.r.t message validness. Multiple potential problems and nuances exists: * Host bits in RTAX_DST sockaddr. Existing applications do send prefixes with hostbits uncleared. Even `route(8)` does this, as they hope the kernel would do the job of fixing it. Code inside `rib_action()` needs to handle it on its own (see `rt_maskedcopy()` ugly hack). * There are multiple way of adding the host route: it can be DST without netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly. Currently, these 2 options create 2 DIFFERENT routes in the kernel. * no sockaddr length/content checking for the "secondary" fields exists: nothing stops rtsock application to send sockaddr_in with length of 25 (instead of 16). Kernel will accept it, install to RIB as is and propagate to all rtsock consumers, potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc. The goal of this change is to make rtsock verify all sockaddr and prefix consistency. Said differently, `rib_action()` or internals should NOT require to change any of the sockaddrs supplied by `rt_addrinfo` structure due to incorrectness. To be more specific, this change implements the following: * sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm. * Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields. * The same netmask checking code converts /32(/128) netmasks to "host" route case (NULL netmask, RTF_HOST), removing the dualism. * Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually supported ones (inet, inet6, link). * Automatically convert `sockaddr_sdl` (AF_LINK) gateways to `sockaddr_sdl_short`. Reported by: Guy Yur <guyyur at gmail.com> Reviewed By: donner Differential Revision: https://reviews.freebsd.org/D28668 MFC after: 3 days
* Add ifa_try_ref() to simplify ifa handling inside epoch.Alexander V. Chernikov2021-02-162-1/+12
| | | | | | | | | | | | | | | | | | | | | | | More and more code migrates from lock-based protection to the NET_EPOCH umbrella. It requires some logic changes, including, notably, refcount handling. When we have an `ifa` pointer and we're running inside epoch we're guaranteed that this pointer will not be freed. However, the following case can still happen: * in thread 1 we drop to 0 refcount for ifa and schedule its deletion. * in thread 2 we use this ifa and reference it * destroy callout kicks in * unhappy user reports bug To address it, new `ifa_try_ref()` function is added, allowing to return failure when we try to reference `ifa` with 0 refcount. Additionally, existing `ifa_ref()` is enforced with `KASSERT` to provide cleaner error in such scenarious. Reviewed By: rstone, donner Differential Revision: https://reviews.freebsd.org/D28639 MFC after: 1 week
* Make in_localip_more() fib-aware.Alexander V. Chernikov2021-02-161-12/+12
| | | | | | | | | | It fixes loopback route installation for the interfaces in the different fibs using the same prefix. Reviewed By: donner PR: 189088 Differential Revision: https://reviews.freebsd.org/D28673 MFC after: 1 week
* jail: Handle a possible race between jail_remove(2) and fork(2)Jamie Gritton2021-02-163-3/+28
| | | | | | | | | | | | | | | jail_remove(2) includes a loop that sends SIGKILL to all processes in a jail, but skips processes in PRS_NEW state. Thus it is possible the a process in mid-fork(2) during jail removal can survive the jail being removed. Add a prison flag PR_REMOVE, which is checked before the new process returns. If the jail is being removed, the process will then exit. Also check this flag in jail_attach(2) which has a similar issue. Reported by: trasz Approved by: kib MFC after: 3 days
* Use iflib_if_init_locked() during media change instead of iflib_init_locked().Allan Jude2021-02-161-1/+1
| | | | | | | | | | | | iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for media changes. iflib_if_init_locked() calls stop then init, so fixes the problem. PR: 253473 MFC after: 3 days Reviewed by: markj Sponsored by: Juniper Networks, Inc., Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28667
* linux: Unmap the VDSO page when unloadingMark Johnston2021-02-165-5/+12
| | | | | | | | | | | linux_shared_page_init() creates an object and grabs and maps a single page to back the VDSO. When destroying the VDSO object, we failed to destroy the mapping and free KVA. Fix this. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28696
* xen/efi: introduce a PV interface for EFI run time services for dom0Roger Pau Monné2021-02-162-0/+256
| | | | | | | | | | | | | FreeBSD when running as a dom0 under Xen is not supposed to access the run time services directly, and instead should proxy the calls through Xen using an hypercall interface that exposes access to selected run time services. Implement the efirt interface on top of the Xen provided hypercalls. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential revision: https://reviews.freebsd.org/D28621
* efirt: add hooks for diverging EFI implementationsRoger Pau Monné2021-02-162-30/+127
| | | | | | | | | | | | | | Introduce a set of hooks for MI EFI public functions, so that a new implementation can be done. This will be used to implement the Xen PV EFI interface that's used when running FreeBSD as a Xen dom0 from UEFI firmware. Also make the efi_status_to_errno non-static since it will be used to evaluate status return values from the PV interface. No functional change indented. Sponsored by: Citrix Systems R&D Reviewed by: kib, imp Differential revision: https://reviews.freebsd.org/D28620
* xen/boot: allow specifying boot method when booted from XenRoger Pau Monné2021-02-165-8/+16
| | | | | | | | | | | | | Allow setting the bootmethod variable from the Xen PVH entry point, in order to be able to correctly set the underlying firmware mode when booted as a dom0. Move the bootmethod variable to be defined in x86/cpu_machdep.c instead so it can be shared by both i386 and amd64. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential revision: https://reviews.freebsd.org/D28619
* stand/multiboot2: add support for booting a Xen dom0 in UEFI modeRoger Pau Monné2021-02-161-0/+1
| | | | | | | | | | | | | | | | | | | | Add some basic multiboot2 infrastructure to the EFI loader in order to be capable of booting a FreeBSD/Xen dom0 when booted from UEFI. Only a very limited subset of the multiboot2 protocol is implemented in order to support enough to boot into Xen, the implementation doesn't intend to be a full multiboot2 capable implementation. Such multiboot2 functionality is hooked up into the amd64 EFI loader, which is the only architecture that supports Xen dom0 on FreeBSD. The options to boot a FreeBSD/Xen dom0 system are exactly the same as on BIOS, and requires setting the xen_kernel and xen_cmdline options in loader.conf. Sponsored by: Citrix Systems R&D Reviewed by: tsoome, imp Differential revision: https://reviews.freebsd.org/D28497
* update the SACK loss recovery to RFC6675, with the following new features:Richard Scheffenegger2021-02-162-5/+64
| | | | | | | | | | | | | | | - improved pipe calculation which does not degrade under heavy loss - engaging in Loss Recovery earlier under adverse conditions - Rescue Retransmission in case some of the trailing packets of a request got lost All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default). Reviewers: #transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff Reviewed By: #transport Subscribers: imp, melifaro MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18985
* zfs: change file mode of all merged testsMartin Matuska2021-02-1632-0/+0
| | | | | | | | If the ksh files are not executable then the tests are not run and reported as failed. MFC after: 2 weeks X-MFC-with: 6b52139eb8e8eda0ea263b24735556194f918642
* UFS snapshots: properly set the vm object size.Konstantin Belousov2021-02-161-0/+4
| | | | | | | | | | | | | | | | | | Citing Kirk: The previous code [before 8563de2f2799b2cb -- kib] did not call vnode_pager_setsize() but worked because later in ffs_snapshot() it does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE() allocated the extra block at the end of the file which caused it to do the needed vnode_pager_setsize(). But the new code had already allocated the extra block, so UFS_WRITE() did not extend the size and thus did not do the vnode_pager_setsize(). PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Reviewed by: mckusick Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week
* pgcache read: protect against reads past end of the vm object sizeKonstantin Belousov2021-02-161-0/+4
| | | | | | | | | | | | If uio_offset is past end of the object size, calculated resid is negative. Delegate handling this case to the locked read, as any other non-trivial situation. PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week
* zfs: merge OpenZFS master-436ab35a5Martin Matuska2021-02-16229-1778/+3260
| | | | | | | | | | | - speed up writing to ZFS pools without ZIL devices (aa755b3) - speed up importing ZFS pools (2d8f72d, a0e0199, cf0977a) ... MFC after: 2 weeks Reviewed by: mjg (partial) Tested by: pho Differential Revision: https://reviews.freebsd.org/D28677
* Fix fget_only_user() to return ENOTCAPABLE on a failed capsicum checkAlex Richardson2021-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After eaad8d1303da500ed691bd774742a4555a05e729 four additional capsicum-test tests started failing. It turns out this is because fget_only_user() was returning EBADF on a failed capsicum check instead of forwarding the return value of cap_check_inline() like fget_unlocked_seq(). capsicum-test failures before this: ``` [ FAILED ] 7 tests, listed below: [ FAILED ] Capability.OperationsForked [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait [ FAILED ] OpenatTest.WithFlag [ FAILED ] ForkedOpenatTest_WithFlagInCapabilityMode._ [ FAILED ] Select.LotsOFileDescriptorsForked ``` After: ``` [ FAILED ] 3 tests, listed below: [ FAILED ] Capability.NoBypassDAC [ FAILED ] Pdfork.OtherUserForked [ FAILED ] PipePdfork.WildcardWait ``` Reviewed By: mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28691
* Remove per-packet ifa refcounting from IPv6 fast path.Alexander V. Chernikov2021-02-159-39/+22
| | | | | | | | | | | | | | | | | | | Currently ip6_input() calls in6ifa_ifwithaddr() for every local packet, in order to check if the target ip belongs to the local ifa in proper state and increase its counters. in6ifa_ifwithaddr() references found ifa. With epoch changes, both `ip6_input()` and all other current callers of `in6ifa_ifwithaddr()` do not need this reference anymore, as epoch provides stability guarantee. Given that, update `in6ifa_ifwithaddr()` to allow it to return ifa without referencing it, while preserving option for getting referenced ifa if so desired. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28648
* Enforce net epoch in in6_selectsrc().Alexander V. Chernikov2021-02-157-0/+22
| | | | | | | | | in6_selectsrc() may call fib6_lookup() in some cases, which requires epoch. Wrap in6_selectsrc* calls into epoch inside its users. Mark it as requiring epoch by adding NET_EPOCH_ASSERT(). MFC after: 1 weeek Differential Revision: https://reviews.freebsd.org/D28647
* Remove now-unused RTF_RNH_LOCKED route flag.Alexander V. Chernikov2021-02-152-3/+1
| | | | MFC after: 1 week
* Fix divide-by-zero panic when ASLR is enabled and superpages disabledJason A. Harmening2021-02-151-2/+3
| | | | | | | | | | | | | | | When locating the anonymous memory region for a vm_map with ASLR enabled, we try to keep the slid base address aligned on a superpage boundary to minimize pagetable fragmentation and maximize the potential usage of superpage mappings. We can't (portably) do this if superpages have been disabled by loader tunable and pagesizes[1] is 0, and it would be less beneficial in that case anyway. PR: 253511 Reported by: johannes@jo-t.de MFC after: 1 week Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28678
* lockmgr: shrink struct lock by 8 bytes on LP64Mateusz Guzik2021-02-154-10/+20
| | | | | | | | | | | | | | | | | | Currently the struct has a 4 byte padding stemming from 3 ints. 1. prio comfortably fits in short, unfortunately there is no dedicated type for it and plumbing it throughout the codebase is not worth it right now, instead an assert is added which covers also flags for safety 2. lk_exslpfail can in principle exceed u_short, but the count is already not considered reliable and it only ever gets modified straight to 0. In other words it can be incrementing with an upper bound of USHRT_MAX With these in place struct lock shrinks from 48 to 40 bytes. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D28680
* linux: drop unneeded castsEdward Tomasz Napierala2021-02-151-3/+3
| | | | | | | No functional changes. Sponsored By: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28533
* zfs: Avoid updating the L2ARC device header unnecessarilyMartin Matuska2021-02-151-1/+3
| | | | | | | | | | | | | From openzfs-master 0ae184a6b commit message: If we do not write any buffers to the cache device and the evict hand has not advanced do not update the cache device header. Cherry-picked from openzfs 0ae184a6baaf71e155e9b19af81b75474622ff58 Patch Author: George Amanakis <gamanakis@gmail.com> MFC after: 3 days Reviewed by: delphij Differential Revision: https://reviews.freebsd.org/D28682
* zfs: fix RAIDZ2/3 not healing parity with 2+ bad disksMartin Matuska2021-02-154-12/+202
| | | | | | | | | | | | | | | | | | | | | | From openzfs-master 62d4287f2 commit message: When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. Cherry-picked from openzfs 62d4287f279a0d184f8f332475f27af58b7aa87e Patch Author: Matthew Ahrens <matthew.ahrens@delphix.com> MFC after: 3 days Reviewed by: delphij Differential Revision: https://reviews.freebsd.org/D28681
* Fix for locking order reversal in USB audio driver, when using mmap().Hans Petter Selasky2021-02-141-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Locking the second lock which causes the LOR, can be skipped because the code updating the shared variables is always executing from the same USB thread. lock order reversal: 1st 0xfffff80005cc3840 pcm7:play:dsp7.p0 (pcm play channel, sleep mutex) @ usb_transfer.c:2342 2nd 0xfffff80005cc3860 pcm7:record:dsp7.r0 (pcm record channel, sleep mutex) @ uaudio.c:2317 lock order pcm record channel -> pcm play channel established at: witness_checkorder+0x461 __mtx_lock_flags+0x98 dsp_mmap_single+0x151 vm_mmap_cdev+0x65 devfs_mmap_f+0x143 kern_mmap_req+0x594 sys_mmap+0x46 amd64_syscall+0x12e fast_syscall_common+0xf8 lock order pcm play channel -> pcm record channel attempted at: witness_checkorder+0xd82 __mtx_lock_flags+0x98 uaudio_chan_play_callback+0xeb usbd_callback_wrapper+0x7ec usb_command_wrapper+0x7e usb_callback_proc+0x8e usb_process+0xf3 fork_exit+0x80 fork_trampoline+0xe Found by: Stefan Ehmann <shoesoft@gmx.net> MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking
* Only require mac_veriexec for verified_execSimon J. Gerraty2021-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The veriexec option is redundant, mac_veriexec is sufficient. MFC after: 1 week # # 72 columns --| # # Uncomment and complete these metadata fields, as appropriate: # # PR: <If and which Problem Report is related.> # Reported by: <If someone else reported the issue.> # Reviewed by: <If someone else reviewed your modification.> # Approved by: <If you needed approval for this commit.> # Obtained from: <If the change is from a third party.> # MFC after: <N [day[s]|week[s]|month[s]]. Request a reminder email> # MFH: <Ports tree branch name. Request approval for merge.> # Relnotes: <Set to 'yes' for mention in release notes.> # Security: <Vulnerability reference (one per line) or description.> # Sponsored by: <If the change was sponsored by an organization.> # Pull Request: <https://github.com/freebsd/<repo>/pull/###> # Differential Revision: <https://reviews.freebsd.org/D###> # # "Pull Request" and "Differential Revision" require the *full* GitHub or # Phabricator URL. The commit author should be set appropriately, using # `git commit --author` if someone besides the committer sent in the change. # # Uncomment and complete these metadata fields, as appropriate: # # PR: # Reported by: <If someone else reported the issue.> # Reviewed by: <If someone else reviewed your modification.> # Approved by: <If you needed approval for this commit.> # Obtained from: <If the change is from a third party.> # MFC after: <N [day[s]|week[s]|month[s]]. Request a reminder email> # MFH: <Ports tree branch name. Request approval for merge.> # Relnotes: <Set to 'yes' for mention in release notes.> # Security: <Vulnerability reference (one per line) or description.> # Sponsored by: <If the change was sponsored by an organization.> # Pull Request: <https://github.com/freebsd/<repo>/pull/###> # Differential Revision: <https://reviews.freebsd.org/D###> # # "Pull Request" and "Differential Revision" require the *full* GitHub or # Phabricator URL. The commit author should be set appropriately, using # `git commit --author` if someone besides the committer sent in the change. #
* pf: Slightly relax pf_rule_addr validationKristof Provost2021-02-141-17/+30
| | | | | | | | | | Ensure we don't reject no-route / urpf-failed addresses. PR: 253479 Reported by: michal AT microwave.sk Revied by: donner@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28650
* tcp: improve behaviour when using TCP_NOOPTMichael Tuexen2021-02-141-1/+4
| | | | | | | | | | Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to an SYN segment received in the SYN-SENT state on a socket having the IPPROTO_TCP level socket option TCP_NOOPT enabled. Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D28656
* Do not reference returned ifa in in6_ifawithifp().Alexander V. Chernikov2021-02-142-12/+3
| | | | | | | | | | | | The only place where in6_ifawithifp() is used is ip6_output(), which uses the returned ifa to bump traffic counters. Given ifa stability guarantees is provided by epoch, do not refcount ifa. This eliminates 2 atomic ops from IPv6 fast path. Reviewed By: rstone MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28649
* procstat: distinguish vm map guards in procstat vm output.Konstantin Belousov2021-02-142-2/+6
| | | | | | | Requested and reviewed by: rwatson (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28658
* hidraw: Make HIDIOCGRDESCSIZE ioctl return report descriptor sizeVladimir Kondratyev2021-02-131-1/+1
| | | | | | | | defined by hardware rather than cached one to match HIDIOCGRDESC ioctl. This fixes errors reported by hid-tools being run against /dev/hidraw# device node belonging to driver which overloads report descriptor. MFC after: 1 week
* hkbd: Fix handling of keyboard ErrorRollOver reportsVladimir Kondratyev2021-02-131-1/+6
| | | | | | | | | | | Ignore fantom keyboard state reports entirelly rather than ignore RollOver states for each key separatelly. Latter results in spurious release/push pairs of events on each fantom keyboard state report. Reported by: Jan Martin Mikkelsen <janm_AT_transactionware_DOT_com> Submitted by: Jan Martin Mikkelsen (initial version) PR: 253249 MFC after: 1 week
* ukbd: Fix handling of keyboard ErrorRollOver reportsVladimir Kondratyev2021-02-131-1/+6
| | | | | | | | | | | Ignore fantom keyboard state reports entirelly rather than ignore RollOver states for each key separatelly. Latter results in spurious release/push pairs of events on each fantom keyboard state report. Reported by: Jan Martin Mikkelsen <janm_AT_transactionware_DOT_com> Submitted by: Jan Martin Mikkelsen (initial version) PR: 253249 MFC after: 1 week
* fusefs: set d_off during VOP_READDIRAlan Somers2021-02-131-6/+7
| | | | | | | | | | | | This allows d_off to be used with lseek to position the file so that getdirentries(2) will return the next entry. It is not used by readdir(3). PR: 253411 Reported by: John Millikin <jmillikin@gmail.com> Reviewed by: cem MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28605
* Fix ifa refcount leak during route addition.Alexander V. Chernikov2021-02-131-4/+2
| | | | | | Reported by: rstone Reviewed by: rstone MFC after: 1 day
* Fix various NOINET* builds broken by 145bf6c0af48.Alexander V. Chernikov2021-02-121-0/+4
| | | | Reported by: mjg, bdragon
* Fix interface route addition with net/bird.Alexander V. Chernikov2021-02-121-24/+26
| | | | | | | | | | The case of adding interface route by specifying interface address as the gateway was missed during code refactoring. Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY is not set. Reviewed by: donner MFC after: 3 days
* Fix bug 253158 - Panic: snapacct_ufs2: bad block - mksnap_ffs(8) crashKirk McKusick2021-02-121-67/+70
| | | | | | | | | | | | | | | | | | | | | | | The panic reported in 253158 arises because the /mnt/.snap/.factory snapshot allocated the last block in the filesystem. The snapshot code allocates the last block in the filesystem as a way of setting its length to be the size of the filesystem. Part of taking a snapshot is to remove all the earlier snapshots from the image of the newest snapshot so that newer snapshots will not claim the blocks of the earlier snapshots. The panic occurs when the new snapshot finds that both it and an earlier snapshot claim the same block. The fix is to set the size of the snapshot to be one block after the last block in the filesystem. This block can never be allocated since it is not a valid block in the filesystem. This extra block is used as a place to store the initial list of blocks that the snapshot has already copied and is used to avoid a deadlock in and speed up the ffs_copyonwrite() function. Reported by: Harald Schmalzbauer Tested by: Peter Holm PR: 253158 Sponsored by: Netflix
* fifo: minor comment and assert improvements.Konstantin Belousov2021-02-122-4/+6
| | | | | | | | | | In particular, replace a note that reload through vget() is obsoleted, with explanation why this code is required. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ffs_unlock: assert that IN_ENDOFF is not leaked past locked scopeKonstantin Belousov2021-02-121-0/+3
| | | | | | | | | | This catches both missed processing of IN_ENDOFF and missed application of VOP_VPUT_PAIR() after VOP that created an entry in the directory. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ffs softdep: Force processing of VI_OWEINACT vnodes when there is inode shortageKonstantin Belousov2021-02-122-0/+63
| | | | | | | | | | Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc() is unable to find a free inode. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* softdep_request_cleanup: wait for softdep_request_clean_flush() to passKonstantin Belousov2021-02-121-0/+6
| | | | | | | | | | | if we noted a parallel request is active and declined to overflow the system with parallel redundant sync of the vnodes. But we need to wait for the flush to finish to see if there are any freed resources. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ufs_inactive(): stop hiding ERELOOKUP from ffs_truncate(), return it.Konstantin Belousov2021-02-122-6/+5
| | | | | | | | | | VFS should retry inactivation when possible, then. This should provide timely removal of unlinked unreferenced inodes. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* Stop ignoring ERELOOKUP from VOP_INACTIVE()Konstantin Belousov2021-02-123-16/+42
| | | | | | | | | | | | | When possible, relock the vnode and retry inactivation. Only vunref() is required not to drop the vnode lock, so handle it specially by not retrying. This is a part of the efforts to ensure that unlinked not referenced vnode does not prevent inode from reusing. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ufs vnops: brace softdep_prelink() with DOINGSUJ instead of DOINGSOFTDEPKonstantin Belousov2021-02-121-6/+6
| | | | | | | | | | | | because softdep_prelink() is reverted to NOP for non-J case. There is no need to do anything before ufs_direnter() in SU/non-J case, everything required to sync the directory is done in VOP_VPUT_PAIR(). Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 week Sponsored by: The FreeBSD Foundation
* ffs softdep: remove will_direnter argument of softdep_prelink()Konstantin Belousov2021-02-123-45/+15
| | | | | | | | | | | | | | | | | | | Originally this was done in 8a1509e442bc9a075 to forcibly cover cases where a hole in the directory could be created by extending into indirect block, since dependency of writing out indirect block is not tracked. This results in excessive amount of fsyncing the directories, where all creation of new entry forced fsync before it. This is not needed, it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides the required hook to only perform required syncing. The series of changes culminating in this commit puts the performance of metadata-intensive loads back to that before 8a1509e442bc9a075. Analyzed by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ufs_direnter: directory truncation does not need special case for renameKonstantin Belousov2021-02-124-26/+23
| | | | | | | | | | | | | | | | | | | In ufs_rename case, tdvp is locked from the place where ufs_direnter() is done till VOP_VPUT_PAIR(), which means that we no longer need to specially handle rename in ufs_direnter(). Truncation, if possible, is done in the same way in ffs_vput_pair() both for rename and other VOPs calling ufs_direnter(). Remove isrename argument and set IN_ENDOFF if ufs_direnter() succeeded and directory needs truncation. In ffs_vput_pair(), stop verifying the condition that directory needs truncation when IN_ENDOFF is set, instead assert that the condition is true. Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ufs_rename: use VOP_VPUT_PAIR and rely on directory sync/truncation thereKonstantin Belousov2021-02-121-28/+6
| | | | | | | | Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ufs_direnter: move directory truncation to ffs_vput_pair().Konstantin Belousov2021-02-123-25/+46
| | | | | | | | | | | | | | | | VOP_VPUT_PAIR() provides the hook to do the truncation right before unlock, which is required since truncation might need to fsync(), which itself might unlock the directory vnode. Set new flag IN_ENDOFF which indicates that i_endoff is valid and should be checked against inode size. Excessive size is chomped, but this operation is advisory and failure to truncate should not result in the failure of the main VOP. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
* ffs_vput_pair(): try harder to recover from the vnode reclaimKonstantin Belousov2021-02-121-3/+36
| | | | | | | | | | | | | | | | | In particular, if unlock_vp is false, save vp's inode number and generation. If ffs_inotovp() can re-create the vnode with the same number and generation after we finished with handling dvp, then we most likely raced with unmount, and were able to restore atomicity of open. We use FFSV_REPLACE_DOOMED there, to drop the old vnode. This additional recovery is not strictly required, but it improves the quality of the implementation. Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation