aboutsummaryrefslogtreecommitdiff
path: root/sys/sys/file.h
Commit message (Collapse)AuthorAgeFilesLines
* As a step towards the elimination of PG_CACHED pages, rework the handlingMark Johnston2015-09-301-2/+0
| | | | | | | | | | | | | | | | | | | | | | | of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726 Notes: svn path=/head/; revision=288431
* Add a new file operations hook for mmap operations. File type-specificJohn Baldwin2015-06-041-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio Notes: svn path=/head/; revision=283998
* Implement eventfd system call.Dmitry Chagin2015-05-241-0/+1
| | | | | | | | Differential Revision: https://reviews.freebsd.org/D1094 In collaboration with: Jilles Tjoelker Notes: svn path=/head/; revision=283444
* filedesc: simplify fget_unlocked & friendsMateusz Guzik2015-02-171-0/+2
| | | | | | | | | | | | | | | | | Introduce fget_fcntl which performs appropriate checks when needed. This removes a branch from fget_unlocked. Introduce fget_mmap dealing with cap_rights_to_vmprot conversion. This removes a branch from _fget. Modify fget_unlocked to pass sequence counter to interested callers so that they can perform their own checks and make sure the result was otained from stable & current state. Reviewed by: silence on -hackers Notes: svn path=/head/; revision=278930
* Remove SF_KQUEUE code. This code was developed at Netflix, but was notGleb Smirnoff2014-11-111-7/+3
| | | | | | | | | | | | | ever used. It didn't go into stable/10, neither was documented. It might be useful, but we collectively decided to remove it, rather leave it abandoned and unmaintained. It is removed in one single commit, so restoring it should be easy, if anyone wants to reopen this idea. Sponsored by: Netflix Notes: svn path=/head/; revision=274403
* Add a new fo_fill_kinfo fileops method to add type-specific information toJohn Baldwin2014-09-221-0/+14
| | | | | | | | | | | | | | | | | struct kinfo_file. - Move the various fill_*_info() methods out of kern_descrip.c and into the various file type implementations. - Rework the support for kinfo_ofile to generate a suitable kinfo_file object for each file and then convert that to a kinfo_ofile structure rather than keeping a second, different set of code that directly manipulates type-specific file information. - Remove the shm_path() and ksem_info() layering violations. Differential Revision: https://reviews.freebsd.org/D775 Reviewed by: kib, glebius (earlier version) Notes: svn path=/head/; revision=271976
* Fix various issues with invalid file operations:John Baldwin2014-09-121-1/+5
| | | | | | | | | | | | | | | | | | | | | - Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(), and invfo_kqfilter() for use by file types that do not support the respective operations. Home-grown versions of invfo_poll() were universally broken (they returned an errno value, invfo_poll() uses poll_no_poll() to return an appropriate event mask). Home-grown ioctl routines also tended to return an incorrect errno (invfo_ioctl returns ENOTTY). - Use the invfo_*() functions instead of local versions for unsupported file operations. - Reorder fileops members to match the order in the structure definition to make it easier to spot missing members. - Add several missing methods to linuxfileops used by the OFED shim layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most of these used invfo_*(), but a dummy fo_stat() implementation was added. Notes: svn path=/head/; revision=271489
* - Remove socket file operations declaration from sys/file.h.Gleb Smirnoff2014-08-261-14/+1
| | | | | | | | | | | - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc. Notes: svn path=/head/; revision=270664
* Fix up races with f_seqcount handling.Mateusz Guzik2014-08-261-1/+2
| | | | | | | | | | | | It was possible that the kernel would overwrite user-supplied hint. Abuse vnode lock for this purpose. In collaboration with: kib MFC after: 1 week Notes: svn path=/head/; revision=270648
* Migrate the sendfile_sync structure into a public(ish) API in preparationAdrian Chadd2013-12-011-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for extending and reusing it. The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper, used to indicate that the backing store for a group of mbufs has completed. It's only being used by sendfile for now and it's only implementing a sleep/wakeup rendezvous. However, there are other potential signaling paths (kqueue) and other potential uses (socket zero-copy write) where the same mechanism would also be useful. So, with that in mind: * extract the sendfile_sync code out into sf_sync_*() methods * teach the sf_sync_alloc method about the current config flag - it will eventually know about kqueue. * move the sendfile_sync code out of do_sendfile() - the only thing it now knows about is the sfs pointer. The guts of the sync rendezvous (setup, rendezvous/wait, free) is now done in the syscall wrapper. * .. and teach the 32-bit compat sendfile call the same. This should be a no-op. It's primarily preparation work for teaching the sendfile_sync about kqueue notification. Tested: * Peter Holm's sendfile stress / regression scripts Sponsored by: Netflix, Inc. Notes: svn path=/head/; revision=258788
* Revert r255672, it has some serious flaws, leaking file references etc.Roman Divacky2013-09-181-2/+0
| | | | | | | Approved by: re (delphij) Notes: svn path=/head/; revision=255675
* Implement epoll support in Linuxulator. This is a tiny wrapper around kqueueRoman Divacky2013-09-181-0/+2
| | | | | | | | | | | | | | | | | to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com> Notes: svn path=/head/; revision=255672
* Change the cap_rights_t type from uint64_t to a structure that we can extendPawel Jakub Dawidek2013-09-051-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t *cap_rights_init(cap_rights_t *rights, ...); void cap_rights_set(cap_rights_t *rights, ...); void cap_rights_clear(cap_rights_t *rights, ...); bool cap_rights_is_set(const cap_rights_t *rights, ...); bool cap_rights_is_valid(const cap_rights_t *rights); void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src); void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src); bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=255219
* Make the seek a method of the struct fileops.Konstantin Belousov2013-08-211-0/+11
| | | | | | | | Tested by: pho Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=254602
* Restore the previous sendfile(2) behaviour on the block devices.Konstantin Belousov2013-08-161-0/+2
| | | | | | | | | | Provide valid .fo_sendfile method for several missed struct fileops. Reviewed by: glebius Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=254415
* Make sendfile() a method in the struct fileops. Currently onlyGleb Smirnoff2013-08-151-0/+16
| | | | | | | | | | | vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix Notes: svn path=/head/; revision=254356
* Merge Capsicum overhaul:Pawel Jakub Dawidek2013-03-021-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ | PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ | PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE | PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK | CAP_READ) #define CAP_PWRITE (CAP_SEEK | CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ) #define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \ CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \ CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \ CAP_SETSOCKOPT | CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib Notes: svn path=/head/; revision=247602
* Do not force a writer to the devfs file to drain the buffer writes.Konstantin Belousov2012-12-231-1/+2
| | | | | | | | Requested and tested by: Ian Lepore <freebsd@damnhippie.dyndns.org> MFC after: 2 weeks Notes: svn path=/head/; revision=244643
* Unbreak handling of descriptors opened with O_EXEC by fexecve(2).Mateusz Guzik2012-07-081-0/+2
| | | | | | | | | | | | | | While here return EBADF for descriptors opened for writing (previously it was ETXTBSY). Add fgetvp_exec function which performs appropriate checks. PR: kern/169651 In collaboration with: kib Approved by: trasz (mentor) MFC after: 1 week Notes: svn path=/head/; revision=238220
* Extend the KPI to lock and unlock f_offset member of struct file. ItKonstantin Belousov2012-07-021-1/+16
| | | | | | | | | | | | | | | | | | | | | now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks Notes: svn path=/head/; revision=238029
* Further refine the implementation of POSIX_FADV_NOREUSE.John Baldwin2012-06-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks Notes: svn path=/head/; revision=237274
* Remove MALLOC_DECLAREs of nonexisting malloc-pools.Ed Schouten2011-11-061-4/+0
| | | | | | | | After careful grepping, it seems none of these pools can be found in our source tree. They are not in use, nor are they defined. Notes: svn path=/head/; revision=227267
* Add the posix_fadvise(2) system call. It is somewhat similar toJohn Baldwin2011-11-041-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month Notes: svn path=/head/; revision=227070
* Add experimental support for process descriptorsJonathan Anderson2011-08-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc Notes: svn path=/head/; revision=224987
* Add the fo_chown and fo_chmod methods to struct fileops and use themKonstantin Belousov2011-08-161-0/+27
| | | | | | | | | | | | | to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz) Notes: svn path=/head/; revision=224914
* Second-to-last commit implementing Capsicum capabilities in the FreeBSDRobert Watson2011-08-111-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc Notes: svn path=/head/; revision=224778
* Rework _fget to accept capability parameters.Jonathan Anderson2011-07-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | This new version of _fget() requires new parameters: - cap_rights_t needrights the rights that we expect the capability's rights mask to include (e.g. CAP_READ if we are going to read from the file) - cap_rights_t *haverights used to return the capability's rights mask (ignored if NULL) - u_char *maxprotp the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted (only used if we are going to mmap the file; ignored if NULL) - int fget_flags FGET_GETCAP if we want to return the capability itself, rather than the underlying object which it wraps Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc Notes: svn path=/head/; revision=223785
* Define cap_rights_t and DTYPE_CAPABILITY, which are required toJonathan Anderson2011-07-011-0/+1
| | | | | | | | | implement Capsicum capabilities. Approved by: mentor (rwatson), re (bz) Notes: svn path=/head/; revision=223710
* - Merge changes to the base system to support OFED. These includeJeff Roberson2011-03-211-0/+1
| | | | | | | | a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features. Notes: svn path=/head/; revision=219819
* Apply band-aid around function-like macro fdrop() without turning it intoJung-uk Kim2010-06-111-1/+8
| | | | | | | | | | | a real (inline) function or applying void casting for all its consumers. In most of places, the "return value" is not checked nor assigned, which causes too many warnings for some smart compilers, i.e., clang. Found by: clang Notes: svn path=/head/; revision=209082
* Use ANSI declarations instead of K&R.Ed Schouten2009-12-281-39/+14
| | | | Notes: svn path=/head/; revision=201149
* style(9)David E. O'Brien2009-01-011-1/+1
| | | | | | | Verfied with: svn diff -x -Bbw file.h showing empty diff Notes: svn path=/head/; revision=186669
* Integrate the new MPSAFE TTY layer to the FreeBSD operating system.Ed Schouten2008-08-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last half year I've been working on a replacement TTY layer for the FreeBSD kernel. The new TTY layer was designed to improve the following: - Improved driver model: The old TTY layer has a driver model that is not abstract enough to make it friendly to use. A good example is the output path, where the device drivers directly access the output buffers. This means that an in-kernel PPP implementation must always convert network buffers into TTY buffers. If a PPP implementation would be built on top of the new TTY layer (still needs a hooks layer, though), it would allow the PPP implementation to directly hand the data to the TTY driver. - Improved hotplugging: With the old TTY layer, it isn't entirely safe to destroy TTY's from the system. This implementation has a two-step destructing design, where the driver first abandons the TTY. After all threads have left the TTY, the TTY layer calls a routine in the driver, which can be used to free resources (unit numbers, etc). The pts(4) driver also implements this feature, which means posix_openpt() will now return PTY's that are created on the fly. - Improved performance: One of the major improvements is the per-TTY mutex, which is expected to improve scalability when compared to the old Giant locking. Another change is the unbuffered copying to userspace, which is both used on TTY device nodes and PTY masters. Upgrading should be quite straightforward. Unlike previous versions, existing kernel configuration files do not need to be changed, except when they reference device drivers that are listed in UPDATING. Obtained from: //depot/projects/mpsafetty/... Approved by: philip (ex-mentor) Discussed: on the lists, at BSDCan, at the DevSummit Sponsored by: Snow B.V., the Netherlands dcons(4) fixed by: kan Notes: svn path=/head/; revision=181905
* Rework the lifetime management of the kernel implementation of POSIXJohn Baldwin2008-06-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semaphores. Specifically, semaphores are now represented as new file descriptor type that is set to close on exec. This removes the need for all of the manual process reference counting (and fork, exec, and exit event handlers) as the normal file descriptor operations handle all of that for us nicely. It is also suggested as one possible implementation in the spec and at least one other OS (OS X) uses this approach. Some bugs that were fixed as a result include: - References to a named semaphore whose name is removed still work after the sem_unlink() operation. Prior to this patch, if a semaphore's name was removed, valid handles from sem_open() would get EINVAL errors from sem_getvalue(), sem_post(), etc. This fixes that. - Unnamed semaphores created with sem_init() were not cleaned up when a process exited or exec'd. They were only cleaned up if the process did an explicit sem_destroy(). This could result in a leak of semaphore objects that could never be cleaned up. - On the other hand, if another process guessed the id (kernel pointer to 'struct ksem' of an unnamed semaphore (created via sem_init)) and had write access to the semaphore based on UID/GID checks, then that other process could manipulate the semaphore via sem_destroy(), sem_post(), sem_wait(), etc. - As part of the permission check (UID/GID), the umask of the proces creating the semaphore was not honored. Thus if your umask denied group read/write access but the explicit mode in the sem_init() call allowed it, the semaphore would be readable/writable by other users in the same group, for example. This includes access via the previous bug. - If the module refused to unload because there were active semaphores, then it might have deregistered one or more of the semaphore system calls before it noticed that there was a problem. I'm not sure if this actually happened as the order that modules are discovered by the kernel linker depends on how the actual .ko file is linked. One can make the order deterministic by using a single module with a mod_event handler that explicitly registers syscalls (and deregisters during unload after any checks). This also fixes a race where even if the sem_module unloaded first it would have destroyed locks that the syscalls might be trying to access if they are still executing when they are unloaded. XXX: By the way, deregistering system calls doesn't do any blocking to drain any threads from the calls. - Some minor fixes to errno values on error. For example, sem_init() isn't documented to return ENFILE or EMFILE if we run out of semaphores the way that sem_open() can. Instead, it should return ENOSPC in that case. Other changes: - Kernel semaphores now use a hash table to manage the namespace of named semaphores nearly in a similar fashion to the POSIX shared memory object file descriptors. Kernel semaphores can now also have names longer than 14 chars (up to MAXPATHLEN) and can include subdirectories in their pathname. - The UID/GID permission checks for access to a named semaphore are now done via vaccess() rather than a home-rolled set of checks. - Now that kernel semaphores have an associated file object, the various MAC checks for POSIX semaphores accept both a file credential and an active credential. There is also a new posixsem_check_stat() since it is possible to fstat() a semaphore file descriptor. - A small set of regression tests (using the ksem API directly) is present in src/tools/regression/posixsem. Reported by: kris (1) Tested by: kris Reviewed by: rwatson (lightly) MFC after: 1 month Notes: svn path=/head/; revision=180059
* Use _WANT_FILE to make struct file visible from userland. This isPawel Jakub Dawidek2008-05-261-1/+3
| | | | | | | | | similar to _WANT_UCRED and _WANT_PRISON and seems to be much nicer than defining _KERNEL. It is also needed for my sys/refcount.h change going in soon. Notes: svn path=/head/; revision=179320
* Replace direct atomic operation for the file refcount witht theAttilio Rao2008-05-251-3/+5
| | | | | | | | | | refcount interface. It also introduces the correct usage of memory barriers, as sometimes fdrop() and fhold() are used with shared locks, which don't use any release barrier. Notes: svn path=/head/; revision=179305
* Implement the per-open file data for the cdev.Konstantin Belousov2008-05-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch does not change the cdevsw KBI. Management of the data is provided by the functions int devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr); int devfs_get_cdevpriv(void **datap); void devfs_clear_cdevpriv(void); All of the functions are supposed to be called from the cdevsw method contexts. - devfs_set_cdevpriv assigns the priv as private data for the file descriptor which is used to initiate currently performed driver operation. dtr is the function that will be called when either the last refernce to the file goes away, the device is destroyed or devfs_clear_cdevpriv is called. - devfs_get_cdevpriv is the obvious accessor. - devfs_clear_cdevpriv allows to clear the private data for the still open file. Implementation keeps the driver-supplied pointers in the struct cdev_privdata, that is referenced both from the struct file and struct cdev, and cannot outlive any of the referee. Man pages will be provided after the KPI stabilizes. Reviewed by: jhb Useful suggestions from: jeff, antoine Debugging help and tested by: pho MFC after: 1 month Notes: svn path=/head/; revision=179175
* Add a new file descriptor type for IPC shared memory objects and use it toJohn Baldwin2008-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implement shm_open(2) and shm_unlink(2) in the kernel: - Each shared memory file descriptor is associated with a swap-backed vm object which provides the backing store. Each descriptor starts off with a size of zero, but the size can be altered via ftruncate(2). The shared memory file descriptors also support fstat(2). read(2), write(2), ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared memory file descriptors. - shm_open(2) and shm_unlink(2) are now implemented as system calls that manage shared memory file descriptors. The virtual namespace that maps pathnames to shared memory file descriptors is implemented as a hash table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash of the pathname. - As an extension, the constant 'SHM_ANON' may be specified in place of the path argument to shm_open(2). In this case, an unnamed shared memory file descriptor will be created similar to the IPC_PRIVATE key for shmget(2). Note that the shared memory object can still be shared among processes by sharing the file descriptor via fork(2) or sendmsg(2), but it is unnamed. This effectively serves to implement the getmemfd() idea bandied about the lists several times over the years. - The backing store for shared memory file descriptors are garbage collected when they are not referenced by any open file descriptors or the shm_open(2) virtual namespace. Submitted by: dillon, peter (previous versions) Submitted by: rwatson (I based this on his version) Reviewed by: alc (suggested converting getmemfd() to shm_open()) Notes: svn path=/head/; revision=175164
* Make ftruncate a 'struct file' operation rather than a vnode operation.John Baldwin2008-01-071-0/+16
| | | | | | | | | | | | | | | | | This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson Notes: svn path=/head/; revision=175140
* Remove explicit locking of struct file.Jeff Roberson2007-12-301-60/+31
| | | | | | | | | | | | | | | | - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho Notes: svn path=/head/; revision=174988
* - Close a race between enumerating UNIX domain socket pcb structures viaJohn Baldwin2007-01-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | sysctl and socket teardown by adding a reference count to the UNIX domain pcb object and fixing the sysctl that enumerates unpcbs to grab a reference on each unpcb while it builds the list to copy out to userland. - Close a race between UNIX domain pcb garbage collection (unp_gc()) and file descriptor teardown (fdrop()) by adding a new garbage collection flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers in a UNIX domain socket looking for nested file descriptor references and clears the flag when it is finished. fdrop() checks to see if the flag is set on a file descriptor whose refcount just dropped to 0 and waits for unp_gc() to clear the flag before completely destroying the file descriptor. MFC after: 1 week Reviewed by: rwatson Submitted by: ups Hopefully makes the panics go away: mx1 Notes: svn path=/head/; revision=165810
* Allow concurrent read(2)/readv(2) access to a file.Paul Saab2006-05-161-1/+7
| | | | | | | | | | | Lock file offset against multiple read calls. Submitted by: ups Obtained from: Yahoo! MFC after: 2 weeks Notes: svn path=/head/; revision=158642
* Bring in experimental kernel support for POSIX message queue.David Xu2005-11-261-0/+1
| | | | Notes: svn path=/head/; revision=152825
* Make some file/filedesc related functions staticPoul-Henning Kamp2005-02-101-1/+0
| | | | Notes: svn path=/head/; revision=141636
* Add a place-holder f_label void * for a future struct label pointerRobert Watson2005-02-021-0/+1
| | | | | | | | required for the port of SELinux FLASK/TE to FreeBSD using the MAC Framework. Notes: svn path=/head/; revision=141137
* /* -> /*- for license, minor formatting changesWarner Losh2005-01-071-1/+1
| | | | Notes: svn path=/head/; revision=139825
* "nfiles" is a bad name for a global variable. Call it "openfiles" insteadPoul-Henning Kamp2004-12-011-2/+2
| | | | | | | as this is more correct and matches the sysctl variable. Notes: svn path=/head/; revision=138261
* Save a pointless FILE_LOCK_ASSERT() call in fhold.Poul-Henning Kamp2004-11-081-1/+1
| | | | Notes: svn path=/head/; revision=137406
* Add the f_vnode pointer to struct xfile, shortly it will no longer bePoul-Henning Kamp2004-06-191-0/+1
| | | | | | | identical to f_data by definition. Notes: svn path=/head/; revision=130717
* Remove advertising clause from University of California Regent's license,Warner Losh2004-04-071-4/+0
| | | | | | | | | per letter dated July 22, 1999. Approved by: core Notes: svn path=/head/; revision=127976