aboutsummaryrefslogtreecommitdiff
path: root/sys/vm
Commit message (Collapse)AuthorAgeFilesLines
* vm: Use proper prototype for SYSINIT functionsZhenlei Huang2026-01-312-3/+3
| | | | | | | MFC after: 1 week (cherry picked from commit a5d5851c86ebba87f580e4f9bada495ebeedc465) (cherry picked from commit 27b24359656a3d30828595ade1b824be3fac4f83)
* vm_object.h: tweak OBJ_ONEMAPPING comment even moreKonstantin Belousov2026-01-101-3/+2
| | | | (cherry picked from commit dcb80621bbf9a733b91f1a011af873318fac2709)
* vm/vm_object.h: clarify the OBJ_ONEMAPPING semanticKonstantin Belousov2026-01-101-2/+3
| | | | (cherry picked from commit 9c923575606bbd29dcf0ec3384150d2d67136cbb)
* vm_fault_trap(): fix comments grammarKonstantin Belousov2025-12-241-6/+6
| | | | (cherry picked from commit 95788a851deb33242c18beb47f8a79eec320dfa5)
* vm_domainset: Ensure round-robin works properlyOlivier Certner2025-12-191-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All iterators that rely on an object's 'struct domainset_ref' (field 'domain' on 'struct vm_object'), which is the case for page allocations with objects, are used with the corresponding object locked for writing, so cannot lose concurrent iterator index's increases even if those are made without atomic operations. The only offender was thread stack allocation, which has just been fixed in commit 3b9b64457676 ("vm: Fix iterator usage in vm_thread_stack_create()"). However, the interleaved policy would still reset the iterator index when restarting, losing track of the next domain to allocate from when applying round-robin, which all allocation policies do if allocation from the first domain fails. Fix this last round-robin problem by not resetting the shared index at iterator's phase init on DOMAINSET_POLICY_INTERLEAVE. Add an assertion to check that, when passed, an object is write-locked in order to prevent the problem mentioned in the first paragraph from reappearing. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D52733 (cherry picked from commit 7b0fe2d405ae09b1247bccc6fa45a6d2755cbe4c) In stable/14, all page allocations with object have the latter locked for writing. Contrary to what happened in main (and stable/15), the offender mentioned in the original commit message never appeared in stable/14 because it was introduced in main by commit 7a79d0669761 ("vm: improve kstack_object pindex calculation to avoid pindex holes") and later fixed by commit 3b9b64457676 ("vm: Fix iterator usage in vm_thread_stack_create()"), both of which were never MFCed. So, the following part of the original commit message: "The only offender was thread stack allocation, which has just been fixed in commit 3b9b64457676 ("vm: Fix iterator usage in vm_thread_stack_create()")." does not apply here.
* uma_core: Rely on domainset iterator to wait on M_WAITOKOlivier Certner2025-12-191-16/+6
| | | | | | | | | | | | | | | | | | | | | Commit 8b987a77691d ("Use per-domain keg locks.") removed the need to lock the keg entirely, replacing it with per-domain keg locks. In particular, it removed the need to hold a lock over waiting for a domain to grow free memory. Simplify the code of keg_fetch_slab() and uma_prealloc() by removing the M_WAITOK -> M_NOWAIT downgrade and the local call to vm_wait_doms() (which used to necessitate temporary dropping the keg lock) which the iterator machinery already handles on M_WAITOK (and compatibly with vm_domainset_iter_ignore() at that, although that does not matter now). Reviewed by: bnovkov, markj Tested by: bnovkov MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D52441 (cherry picked from commit 781802df7a2bfe224ef17596d56cf83c49517655)
* vm_fault: only rely on PG_ZERO when the page was newly allocatedKonstantin Belousov2025-12-151-1/+5
| | | | (cherry picked from commit cff67bc43df14d492ccc08ec92fddceadd069953)
* vm_page.h: remove no longer defined (P) locking annotationKonstantin Belousov2025-12-151-2/+2
| | | | (cherry picked from commit 636ee0269db04ce22a0f5e32723bab79be69443d)
* vm_object_page_remove(): clear pager even if there is no resident pagesKonstantin Belousov2025-12-011-1/+2
| | | | (cherry picked from commit 72a447d0bc768c7fe8a9c972f710c75afebd581b)
* vm_fault_busy_sleep(): update comment after addition of allocflags argKonstantin Belousov2025-10-131-5/+4
| | | | (cherry picked from commit f1b656f14464c2e3ec4ab2eeade3b00dce4bd459)
* vm_fault: assert that first_m is xbusyKonstantin Belousov2025-10-131-0/+8
| | | | (cherry picked from commit a38483fa2b3a26414d3409b12dd35ac406c44cea)
* vm_fault: try to only share-busy page for soft faultsKonstantin Belousov2025-10-131-15/+93
| | | | (cherry picked from commit 149674bbac5842ac883414a6c1e75d829c70d42b)
* vm_fault: add helper vm_fault_can_cow_rename()Konstantin Belousov2025-10-131-9/+11
| | | | (cherry picked from commit 3f05bbdbd80f2eefb647e595dc73e80d6186d6a5)
* vm_fault: add vm_fault_might_be_cow() helperKonstantin Belousov2025-10-131-5/+11
| | | | (cherry picked from commit 5bd4c04a4e7f7bda657e6027e64675d0caf50715)
* vm_fault_busy_sleep(): pass explicit allocflags for vm_page_busy_sleep()Konstantin Belousov2025-10-131-3/+3
| | | | (cherry picked from commit c6b79f587f27649f90e00bc131d37bafa50ffc62)
* vm_fault: drop never-true busy_sleep testDoug Moore2025-10-131-2/+1
| | | | (cherry picked from commit 2d6185cf87e815d4951a9ddcf5c535ebd07a8815)
* vm/vm_fault.c: cleanup includesKonstantin Belousov2025-10-131-2/+0
| | | | (cherry picked from commit 0854b4f569e1e68032e431b1efb45b9fd9849194)
* vm/vm_fault.c: update and split comments for vm_fault() and vm_fault_trap()Konstantin Belousov2025-09-231-12/+30
| | | | (cherry picked from commit 22cce201da76a1916be5c993201f0478f3048292)
* vm_pageout: Scan inactive dirty pages less aggressivelyMark Johnston2025-09-213-15/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider a database workload where the bulk of RAM is used for a fixed-size file-backed cache. Any leftover pages are used for filesystem caching or anonymous memory. In particular, there is little memory pressure and the inactive queue is scanned rarely. Once in a while, the free page count dips a bit below the setpoint, triggering an inactive queue scan. Since almost all of the memory there is used by the database cache, the scan encounters only referenced and/or dirty pages, moving them to the active and laundry queues. In particular, it ends up completely depleting the inactive queue, even for a small, non-urgent free page shortage. This scan might process many gigabytes worth of pages in one go, triggering VM object lock contention (on the DB cache file's VM object) and consuming CPU, which can cause application latency spikes. Observing this behaviour, my observation is that we should abort scanning once we've encountered many dirty pages without meeting the shortage. In general we've tried to make the page daemon control loops avoid large bursts of work, and if a scan fails to turn up clean pages, there's not much use in moving everything to laundry queue at once. In particular, pacing this work ensures that the page daemon isn't frequently acquiring and releasing the VM object lock over long periods, especially when multiple page daemon threads are in use. Modify the inactive scan to abort early if we encounter enough dirty pages without meeting the shortage. If the shortage hasn't been met, this will trigger shortfall laundering, wherein the laundry thread will clean as many pages as needed to meet the instantaneous shortfall. Laundered pages will be placed near the head of the inactive queue, so will be immediately visible to the page daemon during its next scan of the inactive queue. Reviewed by: alc, kib MFC after: 1 month Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D48337 (cherry picked from commit 095f6305772be1dae27e7af9d87db0387625440d)
* vm_domainset: Refactor iterators, multiple fixesOlivier Certner2025-09-195-112/+170
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vm_domainset_iter_first() would not check if the initial domain selected by the policy was effectively valid (i.e., allowed by the domainset and not marked as ignored by vm_domainset_iter_ignore()). It would just try to skip it if it had less pages than 'free_min', and would not take into account the possibility of no domains being valid. Factor out code that logically belongs to the iterator machinery and is not tied to how allocations (or impossibility thereof) are to be handled. This allows to remove duplicated code between vm_domainset_iter_page() and vm_domainset_iter_policy(), and between vm_domainset_iter_page_init() and _vm_domainset_iter_policy_init(). This also allows to remove the 'pages' parameter from vm_domainset_iter_page_init(). This also makes the two-phase logic clearer, revealing an inconsistency between setting 'di_minskip' to true in vm_domainset_iter_init() (implying that, in the case of waiting allocations, further attempts after the first sleep should just allocate for the first domain, regardless of their situation with respect to their 'free_min') and trying to skip the first domain if it has too few pages in vm_domainset_iter_page_init() and _vm_domainset_iter_policy_init(). Fix this inconsistency by resetting 'di_minskip' to 'true' in vm_domainset_iter_first() instead so that, after each vm_wait_doms() (waiting allocations that could not be satisfied immediately), we again start with only the domains that have more than 'free_min' pages. While here, fix the minor quirk that the round-robin policy would start with the domain after the one pointed to by the initial value of 'di_iter' (this just affects the case of resetting '*di_iter', and would not cause domain skips in other circumstances, i.e., for waiting allocations that actually wait or at each subsequent new iterator creation with same iteration index storage). PR: 277476 Tested by: Kenneth Raplee <kenrap_kennethraplee.com> Fixes: 7b11a4832691 ("Add files for r327895") Fixes: e5818a53dbd2 ("Implement several enhancements to NUMA policies.") Fixes: 23984ce5cd24 ("Avoid resource deadlocks when one domain has exhausted its memory."...) MFC after: 10 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D51251 (cherry picked from commit 637d9858e6a8b4a8a3ee4dd80743a58bde4cbd68)
* vm_domainset: Simplify vm_domainset_iter_next()Olivier Certner2025-09-191-30/+2
| | | | | | | | | | | | | | | | | | | | As we are now visiting each domain only once, the test in vm_domainset_iter_prefer() about skipping the preferred domain (the one initially visited for policy DOMAINSET_POLICY_PREFER) becomes redundant. Removing it makes this function essentially the same as vm_domainset_iter_rr(). Thus, remove vm_domainset_iter_prefer(). This makes all policies behave the same in vm_domainset_iter_next(). No functional change (intended). PR: 277476 MFC after: 10 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D51250 (cherry picked from commit d0b691a7c1aacf5a3f5ee6fc53f08563744d7203)
* vm_domainset: Only probe domains once when iterating, instead of up to 4 timesOlivier Certner2025-09-192-23/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of the 'di_minskip' logic, which resets the initial domain, an iterator starts by considering only domains that have more than 'free_min' pages in a first phase, and then all domains in a second one. Non-"underpaged" domains are thus examined twice, even if the allocation can't succeed. Re-scanning the same domains twice just wastes time, as allocation attempts that must not wait may rely on failing sooner and those that must will loop anyway (a domain previously scanned twice has more pages than 'free_min' and consequently vm_wait_doms() will just return immediately). Additionally, the DOMAINSET_POLICY_FIRSTTOUCH policy would aggravate this situation by reexamining the current domain again at the end of each phase. In the case of a single domain, this means doubling again the number of times domain 0 is probed. Implementation consists in adding two 'domainset_t' to 'struct vm_domainset_iter' (and removing the 'di_n' counter). The first, 'di_remain_mask', contains domains still to be explored in the current phase, the first phase concerning only domains with more pages than 'free_min' ('di_minskip' true) and the second one concerning only domains previously under 'free_min' ('di_minskip' false). The second, 'di_min_mask', holds the domains with less pages than 'free_min' encountered during the first phase, and serves as the reset value for 'di_remain_mask' when transitioning to the second phase. PR: 277476 Fixes: e5818a53dbd2 ("Implement several enhancements to NUMA policies.") Fixes: 23984ce5cd24 ("Avoid resource deadlocks when one domain has exhausted its memory."...) MFC after: 10 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D51249 (cherry picked from commit d440953942372ca275d0743a6e220631bde440ee)
* Avoid waiting on physical allocations that can't possibly be satisfiedJason A. Harmening2025-09-195-24/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Change vm_page_reclaim_contig[_domain] to return an errno instead of a boolean. 0 indicates a successful reclaim, ENOMEM indicates lack of available memory to reclaim, with any other error (currently only ERANGE) indicating that reclamation is impossible for the specified address range. Change all callers to only follow up with vm_page_wait* in the ENOMEM case. - Introduce vm_domainset_iter_ignore(), which marks the specified domain as unavailable for further use by the iterator. Use this function to ignore domains that can't possibly satisfy a physical allocation request. Since WAITOK allocations run the iterators repeatedly, this avoids the possibility of infinitely spinning in domain iteration if no available domain can satisfy the allocation request. PR: 274252 Reported by: kevans Tested by: kevans Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D42706 (cherry picked from commit 2619c5ccfe1f7889f0241916bd17d06340142b05) MFCed as a prerequisite for further MFC of VM domainset changes. Based on analysis, it would not hurt, and I have been using it in productions for months now. Resolved the trivial conflict due to commit 718d1928f874 ("LinuxKPI: make linux_alloc_pages() honor __GFP_NORETRY") having been MFCed before this one.
* vm/vm_fault.c: rename vm_fault_hold_pages_e() to vm_fault_hold_pages()Konstantin Belousov2025-09-042-3/+3
| | | | (cherry picked from commit 5a308afeaf3d12562c6a61e06f112f730f4d7270)
* vm_fault.c: rename vm_fault_quick_hold_pages_e() to vm_fault_hold_pages_e()Konstantin Belousov2025-09-042-4/+4
| | | | (cherry picked from commit c18e41de7436f5f3c2516f48fabe6577e082547f)
* vm_fault: improve interface for vm_fault_quick_hold_pages()Konstantin Belousov2025-09-042-20/+63
| | | | (cherry picked from commit 041efb55ec8ba4e379fd1d0a75bd0f637e3d9676)
* vfs: Introduce VN_ISDEV() macroDag-Erling Smørgrav2025-08-281-2/+1
| | | | (cherry picked from commit 567e6250c003eeb251b4bc8dbe60d2adabab2988)
* vm_page: Clear VM_ALLOC_NOCREAT in vm_page_grab_pflags()Mark Johnston2025-08-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise vm_page_grab_zero_partial() and vm_page_grab_pages() can pass it to vm_page_alloc_*(), which results in an assertion failure since that flag is meaningless when allocating a page: panic: invalid request 0x8400 cpuid = 0 time = 1754074745 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x49/frame 0xfffffe00542859c0 vpanic() at vpanic+0x1ea/frame 0xfffffe0054285b00 panic() at panic+0x43/frame 0xfffffe0054285b60 vm_page_alloc_domain_iter() at vm_page_alloc_domain_iter+0x720/frame 0xfffffe0054285be0 vm_page_grab_zero_partial() at vm_page_grab_zero_partial+0x1d4/frame 0xfffffe0054285c90 shm_fspacectl() at shm_fspacectl+0x1cd/frame 0xfffffe0054285d30 kern_fspacectl() at kern_fspacectl+0x49f/frame 0xfffffe0054285db0 sys_fspacectl() at sys_fspacectl+0x5b/frame 0xfffffe0054285e00 amd64_syscall() at amd64_syscall+0x29c/frame 0xfffffe0054285f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0054285f30 Reported by: syzkaller Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D51692 (cherry picked from commit 9a1b3303352beb44d48b8251b80656a316b7a2e9)
* vnode_pager: Remove uses of DEBUG_VFS_LOCKSMark Johnston2025-08-251-1/+1
| | | | | | | | | | | This assertion can reasonably be checked when plain INVARIANTS is configured, there's no need to configure a separate option. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D51699 (cherry picked from commit d733bca67055b6a49b4f7abe78c39188900a7720)
* vm_page: Fix handling of empty bad memory addresses fileRomain Tartière2025-08-111-3/+7
| | | | | | | | | | | | | | | | | If a file with bad memory addresses is configured but that file is empty (0 lines, 0 bytes), when loading it we end up returning an end pointer that is just _before_ the start of the (empty) file content. Adjust the code to make it clear what pre-condition are required to set the *list / *end pointers correctly, and explicitly set them to NULL when they are not matched. Reported by: marklmi@yahoo.com Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D51717 (cherry picked from commit f90940ce6eb71df40538c35a65d77ad3093c679a)
* swapongeom: destroy consumer/close vnode in case swaponsomething failedKonstantin Belousov2025-08-051-5/+12
| | | | (cherry picked from commit 2e3fa9395fc67e7369fda8d8b5c6613142d2a57d)
* sys_swapon: reject too small devicesKonstantin Belousov2025-08-051-6/+9
| | | | (cherry picked from commit aa42e4984997c9d3aa5d30534bdaf760e613e97b)
* vm_page: Fix loading bad memory addresses from fileRomain Tartière2025-08-021-1/+1
| | | | | | | | | | | | | When loading bad memory addresses from a file, we are passed an end pointer that points on the first byte after the buffer. We want the buffer to be null-terminated (by changing the last byte to \0 if it is reasonable to do so), so adjust the end pointer to be on that byte. Approved by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D51433 (cherry picked from commit 202f8bde836dc86627be2b5b98174d9a0fb2eaba)
* vm_pageout: Remove a volatile qualifier from some vm_domain membersMark Johnston2025-07-291-3/+3
| | | | | | | | | | | | | These are always accessed using atomic(9) intrinsics, so do not need the qualifier. No functional change intended. Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D51322 (cherry picked from commit fad79db405052f3faad7184ea2c8bfe9f92a700d)
* swap_pager: Convert swap-space-full flags to boolsMark Johnston2025-07-291-13/+10
| | | | | | | | | | | | No functional change intended. Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D51321 (cherry picked from commit 5c76e9f4579677482b0f96d4b3581f5e1ea2e072)
* vm_domainset: Print correct function in KASSERT()/panic()Olivier Certner2025-07-281-10/+6
| | | | | | | | | | | | | | Some messages in vm_domainset_iter_next() would wrongly refer to vm_domainset_iter_first(). While here, ensure that all assertion/panic messages use '__func__' to avoid this discrepancy in the future if code is moved/copy-pasted again. Reviewed by: markj, alc, kib MFC after: 10 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D51248 (cherry picked from commit 1e78a6a6d85702b84f679712aac71f91e481e8f9)
* vm_pageout: Make the OOM killer less aggressiveMark Johnston2025-07-151-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A problem can arise if we enter a shortfall of clean, inactive pages. The PID controller will attempt to overshoot the reclamation target because repeated scans of the inactive queue are just moving pages to the laundry queue, so inactive queue scans fail to address an instantaneous page shortage. The laundry thread will launder pages and move them back to the head of the inactive queue to be reclaimed, but this does not happen immediately, so the integral term of the PID controller grows and the page daemon tries to reclaim pages in excess of the setpoint. However, the laundry thread will only launder enough pages to meet the shortfall: vm_laundry_target(), which is the same as the setpoint. Oonce the shortfall is addressed by the laundry thread, no more clean pages will appear in the inactive queue, but the page daemon may keep scanning dirty pages due to this overshooting. This can result in a spurious OOM kill. Thus, reset the sequence counter if we observe that there is no instantaneous free page shortage. Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D51015 (cherry picked from commit 78546fb0e3215c07f970c1bcbf15bba2f5852c76)
* vm_pageout_scan_inactive: take a lock breakRyan Libby2025-07-102-1/+21
| | | | | | | | | | | | | | In vm_pageout_scan_inactive, release the object lock when we go to refill the scan batch queue so that someone else has a chance to acquire it. This improves access latency to the object when the pagedaemon is processing many consecutive pages from a single object, and also in any case avoids a hiccup during refill for the last touched object. Reviewed by: alc, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D45288 (cherry picked from commit a216e311a70cc87a5646f4306e36c60a51706699)
* pmap_growkernel(): do not panic immediately, optionally return the errorKonstantin Belousov2025-06-262-3/+8
| | | | (cherry picked from commit ef9017aa174db96ee741b936b984f2b5d61dff9f)
* vm_fault: Defer marking COW pages validMark Johnston2025-05-131-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Suppose an object O has two shadow objects S1, S2 mapped into processes P1, P2. Suppose a page resident in O is mapped read-only into P1. Now suppose that P1 writes to the page, triggering a COW fault: it allocates a new page in S1 and copies the page, then marks it valid. If the page in O was busy when initially looked up, P1 would have to release the map lock and sleep first. Then, after handling COW, P1 must re-check the map lookup because locks were dropped. Suppose the map indeed changed, so P1 has to retry the fault. At this point, the mapped page in O is shadowed by a valid page in S1. If P2 exits, S2 will be deallocated, resulting in a collapse of O into S1. In this case, because the mapped page is shadowed, P2 will free it, but that is illegal; this triggers a "freeing mapped page" assertion in invariants kernels. Fix the problem by deferring the vm_page_valid() call which marks the COW copy valid: only mark it once we know that the fault handler will succeed. It's okay to leave an invalid page in the top-level object; it will be freed when the fault is retried, and vm_object_collapse_scan() will similarly free invalid pages in the shadow object. Reviewed by: kib MFC after: 1 month Sponsored by: Innovate UK Differential Revision: https://reviews.freebsd.org/D49758 (cherry picked from commit c98367641991019bac0e8cd55b70682171820534)
* vm_pageout: Disallow invalid values for act_scan_laundry_weightMark Johnston2025-05-021-2/+17
| | | | | | | PR: 234167 MFC after: 2 weeks (cherry picked from commit d8b03c5904faff84656d3a84a25c2b37bcbf8075)
* vm_object: Make a comment more clearMark Johnston2025-04-241-1/+1
| | | | | | | | Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49675 (cherry picked from commit da05ca9ab655272569f4af99c86d2aff97a0d2ab)
* vm_object: Fix handling of wired map entries in vm_object_split()Mark Johnston2025-04-182-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Suppose a vnode is mapped with MAP_PROT and MAP_PRIVATE, mlock() is called on the mapping, and then the vnode is truncated such that the last page of the mapping becomes invalid. The now-invalid page will be unmapped, but stays resident in the VM object to preserve the invariant that a range of pages mapped by a wired map entry is always resident. This invariant is checked by vm_object_unwire(), for example. Then, suppose that the mapping is upgraded to PROT_READ|PROT_WRITE. We will copy the invalid page into a new anonymous VM object. If the process then forks, vm_object_split() may then be called on the object. Upon encountering an invalid page, rather than moving it into the destination object, it is removed. However, this is wrong when the entry is wired, since the invalid page's wiring belongs to the map entry; this behaviour also violates the invariant mentioned above. Fix this by moving invalid pages into the destination object if the map entry is wired. In this case we must not dirty the page, so add a flag to vm_page_iter_rename() to control this. Reported by: syzkaller Reviewed by: dougm, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49443 (cherry picked from commit 43c1eb894a57ef30562a02708445c512610d4f02)
* vm_page_startup(): Clarify memory lowest, highest and size computationOlivier Certner2025-04-081-21/+20
| | | | | | | | | | | | | | | | | | | | | | | | Change the comment before this block of code, and separate the latter from the preceding one by an empty line. Move the loop on phys_avail[] to compute the minimum and maximum memory physical addresses closer to the initialization of 'low_avail' and 'high_avail', so that it's immediately clear why the loop starts at 2 (and remove the related comment). While here, fuse the additional loop in the VM_PHYSSEG_DENSE case that is used to compute the exact physical memory size. This change suppresses one occurence of detecting whether at least one of VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE is defined at compile time, but there is still another one in PHYS_TO_VM_PAGE(). Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48632 (cherry picked from commit 16317a174a5288f0377f8d40421b5c7821d57ac2)
* vm_phys_early_startup(): Panic if phys_avail[] is emptyOlivier Certner2025-04-081-0/+3
| | | | | | | | | Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48631 (cherry picked from commit 32e77bcdec5c034a9252876aa018f0bf34b36dbc)
* vm_phys_avail_split(): Tolerate split requests at boundariesOlivier Certner2025-04-081-7/+15
| | | | | | | | | | | | | | | | | | | | | | Previously, such requests would lead to a panic. The only caller so far (vm_phys_early_startup()) actually faces the case where some address can be one of the chunk's boundaries and has to test it by hand. Moreover, a later commit will introduce vm_phys_early_alloc_ex(), which will also have to deal with such boundary cases. Consequently, make this function handle boundaries by not splitting the chunk and returning EJUSTRETURN instead of 0 to distinguish this case from the "was split" result. While here, expand the panic message when the address to split is not in the passed chunk with available details. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48630 (cherry picked from commit e1499bfff8b8c128d7b3d330f95e0c67d7c1fa77)
* vm_phys_avail_count(): Fix out-of-bounds accessesOlivier Certner2025-04-081-6/+4
| | | | | | | | | | | | | On improper termination of phys_avail[] (two consecutive 0 starting at an even index), this function would (unnecessarily) continue searching for the termination markers even if the index was out of bounds. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48629 (cherry picked from commit 291b7bf071e8b50f2b7877213b2d3307ae5d3e38)
* vm_phys: Check for overlap when adding a segmentOlivier Certner2025-04-081-5/+13
| | | | | | | | | | | | | | | | | | Segments are passed by machine-dependent routines, so explicit checks will make debugging much easier on very weird machines or when someone is tweaking these machine-dependent routines. Additionally, this operation is not performance-sensitive. For the same reasons, test that we don't reach the maximum number of physical segments (the compile-time of the internal storage) in production kernels (replaces the existing KASSERT()). Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48628 (cherry picked from commit 8a14ddcc1d8e4384d8ad77c5536c916c6e9a7d65)
* vm_phys_add_seg(): Check for bad segments, allow empty onesOlivier Certner2025-04-082-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | A bad specification is if 'start' is strictly greater than 'end', or bounds are not page aligned. The latter was already tested under INVARIANTS, but now will be also on production kernels. The reason is that vm_phys_early_startup() pours early segments into the final phys_segs[] array via vm_phys_add_seg(), but vm_phys_early_add_seg() did not check their validity. Checking segments once and for all in vm_phys_add_seg() avoids duplicating validity tests and is possible since early segments are not used before being poured into phys_segs[]. Finally, vm_phys_add_seg() is not performance critical. Allow empty segments and discard them (silently, unless 'bootverbose' is true), as vm_page_startup() was testing for this case before calling vm_phys_add_seg(), and we felt the same test in vm_phys_early_startup() was due before calling vm_phys_add_seg(). As a consequence, remove the empty segment test from vm_page_startup(). Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48627 (cherry picked from commit f30309abcce4cec891413da5cba2db92dd6ab0d7)
* vm_phys_avail_check(): Check index parity, fix panic messagesOlivier Certner2025-04-081-4/+6
| | | | | | | | | | | | | | | | The passed index must be the start of a chunk in phys_avail[], so must be even. Test for that and print a separate panic message. While here, fix panic messages: In one, the wrong chunk boundary was printed, and in another, the desired but not the actual condition was printed, possibly leading to confusion. Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D48626 (cherry picked from commit 125ef4e041fed40fed2d00b0ddd90fa0eb7b6ac3)