aboutsummaryrefslogtreecommitdiff
path: root/cddl/contrib/opensolaris
Commit message (Collapse)AuthorAgeFilesLines
...
* Fix a bug introduced with parallel mounting of zfsBaptiste Daroussin2019-07-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) MFC after: 3 days Notes: svn path=/head/; revision=350358
* Reference stdint.h types in ctf.5.Mark Johnston2019-07-171-39/+39
| | | | | | | MFC after: 1 week Notes: svn path=/head/; revision=350082
* zpool.8: the comment property is not read-onlyAllan Jude2019-06-061-4/+0
| | | | | | | | | | | | | The comment property was listed in the man page twice, once under the list of read-only properties, and again (correctly), under the list of user editable properties. PR: 238355 Reported by: Michael Zuo <muh.muhten@gmail.com> Sponsored by: Klara Systems Notes: svn path=/head/; revision=348714
* DTrace: create an amd64 test suitMariusz Zaborski2019-06-053-0/+161
| | | | | | | | | | | | | Create two tests checking if we can read urgs registers and if the rax register returns a correct number. Reviewed by: markj Discussed with: lwhsu MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20364 Notes: svn path=/head/; revision=348706
* MFV r348583: 9847 leaking dd_clones (DMU_OT_DSL_CLONES) objectsAlexander Motin2019-06-031-3/+234
| | | | | | | | | | | | | illumos/illumos-gate@17fb938fd6cdce3ff1bb47dafda0774f742249a3 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> Notes: svn path=/head/; revision=348584
* MFV r348580: 9559 zfs diff handles files on delete queue in fromsnap poorlyAlexander Motin2019-06-031-7/+7
| | | | | | | | | | | illumos/illumos-gate@20633e304b57bc98f70fdb194081b7023adf527b Reviewed by: Joshua M. Clulow <josh@sysmgr.org> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com> Notes: svn path=/head/; revision=348581
* MFV r348578: 9962 zil_commit should omit cache thrashAlexander Motin2019-06-031-3/+1
| | | | | | | | | | | | | | illumos/illumos-gate@cab3a55e158118937e07d059c46f1bc14d1f254d Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Prakash Surya <prakash.surya@delphix.com> Notes: svn path=/head/; revision=348579
* MFV r348568: 9466 add JSON output support to channel programsAlexander Motin2019-06-033-10/+30
| | | | | | | | | | | | | | illumos/illumos-gate@5267591016146502784860802129b16dab6f135c Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Alek Pinchuk <apinchuk@datto.com> Notes: svn path=/head/; revision=348570
* MFV r348553: 9681 ztest failure in spa_history_log_internal due to spa_rename()Alexander Motin2019-06-031-74/+0
| | | | | | | | | | | | illumos/illumos-gate@6aee0ad76969eb0027131b3a338f2d94ae86f728 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> Notes: svn path=/head/; revision=348565
* MFV r348552: 9682 page fault in dsl_async_clone_destroy() while opening poolAlexander Motin2019-06-031-2/+3
| | | | | | | | | | | | | illumos/illumos-gate@ade2c82828f0dca1f46919aa1bd936ea1a5a0047 Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> Notes: svn path=/head/; revision=348564
* MFV r348534: 9616 Bogus error when attempting to set property on read-only poolAlexander Motin2019-06-031-3/+8
| | | | | | | | | | | | illumos/illumos-gate@f62db44dbcda5dd786bb821f1e6fd3ca2e6d4391 Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andrew Stormont <astormont@racktopsystems.com> Notes: svn path=/head/; revision=348557
* [PowerPC64] stand: fix build using clang 8 as compilerLeandro Lupori2019-05-201-3/+3
| | | | | | | | | | | | This change fixes "stand" build issues when using clang 8 as compiler. Submitted by: alfredo.junior_eldorado.org.br Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D20026 Notes: svn path=/head/; revision=348005
* MFV/ZoL: `zfs userspace` ignored all unresolved UIDs after the firstAllan Jude2019-05-181-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | zfsonlinux/zfs@88cfff182432e4d1c24c877f33b47ee6cf109eee zfs_main: fix `zfs userspace` squashing unresolved entries The `zfs userspace` squashes all entries with unresolved numeric values into a single output entry due to the comparsion always made by the string name which is empty in case of unresolved IDs. Fix this by falling to a numerical comparison when either one of string values is not found. This then compares any numerical values after all with a name resolved. Signed-off-by: Pavel Boldin <boldin.pavel@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported by: clusteradm Obtained from: ZFS-on-Linux MFC after: 3 days Notes: svn path=/head/; revision=347953
* Fix dataset name comparison in zfs_compare().Alexander Motin2019-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | The code never returned match comparing two datasets (not snapshots). As result, uu_avl_find(), called from zfs_callback(), never succeeded, allowing to add same dataset into the list multiple times, for example: # zfs get name pers pers pers@z pers@z NAME PROPERTY VALUE SOURCE pers name pers - pers name pers - pers@z name pers@z - With the patch: # zfs get name pers pers pers@z pers@z NAME PROPERTY VALUE SOURCE pers name pers - pers@z name pers@z - MFC after: 1 week Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=347240
* Add a trailing empty line to match the test code outputLi-Wen Hsu2019-04-291-0/+1
| | | | | | | | | | | This is added for letting these long failing test case pass, and for consistency. The test code should be fixed later to not output this extra empty line. Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=346873
* Some test scripts use ncat --sctp --listen port to run an SCTP discardMichael Tuexen2019-04-282-36/+68
| | | | | | | | | | | | | | | | | | | server in the background. However, when running in the background, stdin is closed and ncat initiates a graceful shutdown of the SCTP association. This is not expected by the client. Therefore, the ncat-based discard server is replaced by a perl-based one. In addition, to remove the dependency from ncat, which needs to be installed via the nmap port, also the code testing for a free SCTP port is changed to use the perl-based client. Finally, remove some debug output from the report generated. Reviewed by: lwhsu@ Differential Revision: https://reviews.freebsd.org/D20086 Notes: svn path=/head/; revision=346854
* Ensure that we use a 64-bit value for the last mmap() argument.Mark Johnston2019-03-201-1/+2
| | | | | | | | | | | | When using __syscall(2), the offset argument is passed on the stack on amd64. Previously only 32 bits were written, so the upper 32 bits were garbage and could cause the test to fail. MFC after: 3 days Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=345355
* Fix a regression introduced in r344569Baptiste Daroussin2019-02-271-3/+0
| | | | | | | | | | | | | | | Import a fix from illumos (thanks Toomas Soomas for pointing at it) See https://www.illumos.org/issues/10205 for more details Illumos commit: https://github.com/illumos/illumos-gate/commit/247b7da039fd88350c50e3d7fef15bdab6bef215 Submitted by: jack@gandi.net Reported by: cy Reviewed by: tsoome, cy, bapt Obtained from: Illumos Notes: svn path=/head/; revision=344621
* Fix regression introduced in r344569Baptiste Daroussin2019-02-271-2/+2
| | | | | | | | | Reported by: cy Tested by: cy Submitted by: Fatih Acar <fatih@gandi.net> Notes: svn path=/head/; revision=344618
* Set process title during zfs send.Sean Eric Fagan2019-02-264-14/+44
| | | | | | | | | | | | | | | | This adds a '-V' option to 'zfs send', which sets the process title once a second to the progress information. This code has been in FreeNAS for a long time now; this is just upstreaming it here. It was originially written by delphij. Reviewed by: mav Obtained from: iXsystems, Inc Sponsored by: iXsystems, Inc Differential Revision: https://reviews.freebsd.org/D19184 Notes: svn path=/head/; revision=344601
* Implement parallel mounting for ZFS filesystemBaptiste Daroussin2019-02-265-130/+464
| | | | | | | | | | | | | | | | | | | | | | | | | | It was first implemented on Illumos and then ported to ZoL. This patch is a port to FreeBSD of the ZoL version. This patch also includes a fix for a race condition that was amended With such patch Delphix has seen a huge decrease in latency of the mount phase (https://github.com/openzfs/openzfs/commit/a3f0e2b569 for details). With that current change Gandi has measured improvments that are on par with those reported by Delphix. Zol commits incorporated: https://github.com/zfsonlinux/zfs/commit/a10d50f999511d304f910852c7825c70c9c9e303 https://github.com/zfsonlinux/zfs/commit/e63ac16d25fbe991a356489c86d4077567dfea21 Reviewed by: avg, sef Approved by: avg, sef Obtained from: ZoL MFC after: 1 month Relnotes: yes Sponsored by: Gandi.net Differential Revision: https://reviews.freebsd.org/D19098 Notes: svn path=/head/; revision=344569
* Increase ctfconvert buffer sizeLeandro Lupori2019-02-251-1/+1
| | | | | | | | Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D19353 Notes: svn path=/head/; revision=344534
* MFV r344364:Mark Johnston2019-02-201-4/+3
| | | | | | | | | | | 9058 postmortem DTrace frequently broken under vmware illumos/illumos-gate@793bd7e3617ae7d3d24e8c6b7d6befe35f07ec1f MFC after: 1 week Notes: svn path=/head/; revision=344366
* zpool.8: sort zpool status flags in the same order as in illumos manualAndriy Gapon2019-02-201-8/+8
| | | | | | | | | Just in case, while I was here. MFC after: 1 week Notes: svn path=/head/; revision=344361
* zpool.8: document -D flag for zpool statusAndriy Gapon2019-02-201-3/+9
| | | | | | | | | | The description is taken from the illumos manual. Reported by: stilezy@gmail.com MFC after: 1 week Notes: svn path=/head/; revision=344360
* fix userland illumos taskq code to pass relative timeout to cv_timedwaitAndriy Gapon2019-02-201-0/+5
| | | | | | | | | | | | Unlike illumos, FreeBSD cv_timedwait requires a relative timeout. That applies both to the kernel illumos compatibility code and to the userland "fake kernel" code. MFC after: 2 weeks Sponsored by: Panzura Notes: svn path=/head/; revision=344359
* MFV r342532: 5882 Temporary pool namesAndriy Gapon2018-12-261-25/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note that this commit brings only formatting changes that were done during the final review of the illumos change, because FreeBSD got the main changes before illumos. illumos/illumos-gate@04e56356520b98d5a93c496b10f02530bb6647e0 https://github.com/illumos/illumos-gate/commit/04e56356520b98d5a93c496b10f02530bb6647e0 https://www.illumos.org/issues/5882 This is an import of the temporary pool names functionality from ZoL: https://github.com/zfsonlinux/zfs/commit/e2282ef57edc79cdce2a4b9b7e3333c56494a807 https://github.com/zfsonlinux/zfs/commit/26b42f3f9d03f85cc7966dc2fe4dfe9216601b0e https://github.com/zfsonlinux/zfs/commit/2f3ec9006146844af6763d1fa4e823fd9047fd54 https://github.com/zfsonlinux/zfs/commit/00d2a8c92f614f49d23dea5d73f7ea7eb489ccf1 https://github.com/zfsonlinux/zfs/commit/83e9986f6eefdf0afc387f06407087bba3ead4e9 https://github.com/zfsonlinux/zfs/commit/023bbe6f017380f4a04c5060feb24dd8cdda9fce It is intended to assist the creation and management of virtual machines that have their rootfs on ZFS on hosts that also have their rootfs on ZFS. These situations cause SPA namespace collisions when the standard name rpool is used in both cases. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. MFC after: 1 week Sponsored by: Panzura Notes: svn path=/head/; revision=342541
* MFV r342469: 9630 add lzc_rename and lzc_destroy to libzfs_coreAndriy Gapon2018-12-264-32/+58
| | | | | | | | | | | | | | | illumos/illumos-gate@049ba636fa37a2892809192fc671bff9158a01cd https://github.com/illumos/illumos-gate/commit/049ba636fa37a2892809192fc671bff9158a01cd https://www.illumos.org/issues/9630 Rename and destroy are very useful operations that deserve to be in libzfs_core. And they are not hard to implement too. MFC after: 2 weeks Relnotes: maybe Notes: svn path=/head/; revision=342525
* dtrace(1): remove reference to dtruss that was removed from baseYuri Pankov2018-10-311-2/+1
| | | | | | | | | | | | system in r300226. PR: 211618 Reviewed by: gnn, markj, 0mp Approved by: kib (mentor, implicit) Differential Revision: https://reviews.freebsd.org/D17762 Notes: svn path=/head/; revision=339956
* Add support for send, receive and state-change DTrace providers forMichael Tuexen2018-08-229-7/+623
| | | | | | | | | | | | SCTP. They are based on what is specified in the Solaris DTrace manual for Solaris 11.4. Reviewed by: 0mp, dteske, markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16839 Notes: svn path=/head/; revision=338213
* Add partial documentation for dtrace(1)'s -x configuration options.Mark Johnston2018-08-161-2/+111
| | | | | | | | | | | | | Some options are still missing descriptions, but they can be filled in over time. Submitted by: raichoo <raichoo@googlemail.com> Reviewed by: 0mp (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16671 Notes: svn path=/head/; revision=337926
* MFV/ZoL: Implement large_dnode pool featureMatt Macy2018-08-123-47/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 50c957f702ea6d08a634e42f73e8a49931dd8055 Author: Ned Bass <bass6@llnl.gov> Date: Wed Mar 16 18:25:34 2016 -0700 Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542 Notes: svn path=/head/; revision=337669
* Extend the info about the limitations of datasets in jails.Alexander Leidinger2018-08-111-3/+6
| | | | | | | | Reviewed by: allanjude Sponsored by: Essen Hackathon Notes: svn path=/head/; revision=337658
* Disable the D subroutines msgsize() and msgdsize().Mark Johnston2018-08-101-1/+1
| | | | | | | | | | They are specific to illumos and the corresponding DIF subroutines are already disabled on FreeBSD. Reported by: gnn Notes: svn path=/head/; revision=337584
* Performance optimization of AVL tree comparator functionsMatt Macy2018-08-104-19/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MFV: commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f Author: Gvozden Neskovic <neskovic@gmail.com> Date: Sat Aug 27 20:12:53 2016 +0200 perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033 Notes: svn path=/head/; revision=337567
* MFV r337223:Alexander Motin2018-08-031-0/+3
| | | | | | | | | | | | | | 9580 Add a hash-table on top of nvlist to speed-up operations illumos/illumos-gate@2ec7644aab2a726a64681fa66c6db8731b160de1 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> Notes: svn path=/head/; revision=337227
* MFV 337214:Alexander Motin2018-08-032-0/+16
| | | | | | | | | | | | | | | | 9621 Make createtxg and guid properties public illumos/illumos-gate@e8d4a73c868afb740396041be80ed2b141065e76 Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josh Paetzel <josh@tcbug.org> Notes: svn path=/head/; revision=337215
* MFV r337184: 9457 libzfs_import.c:add_config() has a memory leakAlexander Motin2018-08-021-12/+4
| | | | | | | | | | | | | | | | | A memory leak occurs on lines 209 and 213 because the config is not freed in the error case. The interface to add_config() seems less than ideal - it would be better if it copied any data necessary from the config and the caller freed it. illumos/illumos-gate@ddfe901b12348d31c500fb57f9174e88860a4061 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: sara hartse <sara.hartse@delphix.com> Notes: svn path=/head/; revision=337185
* MFV r337182: 9330 stack overflow when creating a deeply nested datasetAlexander Motin2018-08-022-1/+23
| | | | | | | | | | | | | | | | Datasets that are deeply nested (~100 levels) are impractical. We just put a limit of 50 levels to newly created datasets. Existing datasets should work without a problem. illumos/illumos-gate@5ac95da7d61660aa299c287a39277cb0372be959 Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Notes: svn path=/head/; revision=337183
* 9523 Large alloc in zdb can cause troubleAlexander Motin2018-08-021-4/+3
| | | | | | | | | | | | | | | | | | | | 16MB alloc in zdb_embedded_block() can cause cores in certain situations (clang, gcc55). OsX commit: https://github.com/openzfsonosx/zfs/commit/ced236a5da6e72ea7bf6d2919fe14e17cffe10f1 FreeBSD commit: https://svnweb.freebsd.org/base?view=revision&revision=326150 illumos/illumos-gate@03a4c2f4bfaca30115963b76445279b36468a614 Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Jorgen Lundman <lundman@lundman.net> This is an update for r326150 (by avg), where this change comes from. Notes: svn path=/head/; revision=337179
* MFV r337161: 9512 zfs remap poolname@snapname coredumpsAlexander Motin2018-08-022-2/+31
| | | | | | | | | | | | | | | | | | Only filesystems and volumes are valid "zfs remap" parameters: when passed a snapshot name zfs_remap_indirects() does not handle the EINVAL returned from libzfs_core, which results in failing an assertion and consequently crashing. illumos/illumos-gate@0b2e8253986c5c761129b58cfdac46d204903de1 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: loli10K <ezomori.nozomu@gmail.com> Notes: svn path=/head/; revision=337163
* Do not blindly include illumos kernel headers instead of user-space.Alexander Motin2018-08-021-0/+2
| | | | | | | | It is not needed now, and I doubt it much helped at all, creating more confusions then good. Notes: svn path=/head/; revision=337160
* MFV r316926:Alexander Motin2018-08-015-15/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | 7955 libshare needs to initialize only those datasets being modified by the consumer illumos/illumos-gate@8a981c3356b194b3b5c0ae9276a9cc31cd2f93a3 https://github.com/illumos/illumos-gate/commit/8a981c3356b194b3b5c0ae9276a9cc31cd2f93a3 https://www.illumos.org/issues/7955 Libshare currently initializes all available filesystems when doing any libshare operation. This requires iterating through all the filesystem multiple times, which is a huge performance problem for sharing and unsharing operations. Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Daniel Hoffman <dj.hoffman@delphix.com> For FreeBSD this is practically a NOP, just a diff reduction. Notes: svn path=/head/; revision=337063
* Add a dtrace provider for UDP-Lite.Michael Tuexen2018-07-314-0/+250
| | | | | | | | | | | | | | The dtrace provider for UDP-Lite is modeled after the UDP provider. This fixes the bug that UDP-Lite packets were triggering the UDP provider. Thanks to dteske@ for providing the dwatch module. Reviewed by: dteske@, markj@, rrs@ Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16377 Notes: svn path=/head/; revision=337018
* MFV r337014:Alexander Motin2018-07-312-5/+18
| | | | | | | | | | | | | | | 9421 zdb should detect and print out the number of "leaked" objects 9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue illumos/illumos-gate@20b5dafb425396adaebd0267d29e1026fc4dc413 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com> Notes: svn path=/head/; revision=337017
* MFV r336991, r337001:Alexander Motin2018-07-318-1/+428
| | | | | | | | | | | | | | | | | | | | | | | | 9102 zfs should be able to initialize storage devices The first access to a disk block can incur a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that volumes be "thick provisioned", where supported by the platform (VMware). Thick provisioning is time consuming and often is ignored. If the thick provision step is omitted, customers will see suboptimal performance until we have written to all parts of the LUN. ZFS should be able to initialize any unused storage to remove any first-write penalty that exists. illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516 Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com> Notes: svn path=/head/; revision=337007
* MFV r336955: 9236 nuke spa_dbgmsgAlexander Motin2018-07-311-4/+1
| | | | | | | | | | | | | | | | | | | We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> Notes: svn path=/head/; revision=336956
* MFV r336950: 9290 device removal reduces redundancy of mirrorsAlexander Motin2018-07-313-11/+67
| | | | | | | | | | | | | | | | | | | | | Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> Notes: svn path=/head/; revision=336951
* MFV r336946: 9238 ZFS Spacemap Encoding V2Alexander Motin2018-07-303-48/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> Notes: svn path=/head/; revision=336947
* MFV r336944: 9286 want refreservation=autoAlexander Motin2018-07-303-10/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a ZFS volume is created with zfs create -V (but without -s), the refreservation property is set to a value that is volsize plus the maximum size of metadata. If refreservation is ever set to another value, it is impossible to set it back to the automatically determined value. There are other cases where refreservation may be wrong. These include receiving a volume that was sent without properties and zfs clone. We need: zfs set refreservation=auto <volume> zfs clone -o refreservation=auto <volume> Each one would use the same function used by zfs create -V to determine the proper value for refreservation. illumos/illumos-gate@1c10ae76c0cb31326c320e7cef1d3f24a1f47125 Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Mike Gerdts <mike.gerdts@joyent.com> Notes: svn path=/head/; revision=336945