diff options
Diffstat (limited to 'share/man/man4/netlink.4')
-rw-r--r-- | share/man/man4/netlink.4 | 357 |
1 files changed, 357 insertions, 0 deletions
diff --git a/share/man/man4/netlink.4 b/share/man/man4/netlink.4 new file mode 100644 index 000000000000..2fa974df0ddf --- /dev/null +++ b/share/man/man4/netlink.4 @@ -0,0 +1,357 @@ +.\" +.\" Copyright (C) 2022 Alexander Chernikov <melifaro@FreeBSD.org>. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd November 30, 2022 +.Dt NETLINK 4 +.Os +.Sh NAME +.Nm Netlink +.Nd Kernel network configuration protocol +.Sh SYNOPSIS +.In netlink/netlink.h +.In netlink/netlink_route.h +.Ft int +.Fn socket AF_NETLINK SOCK_RAW "int family" +.Sh DESCRIPTION +Netlink is a user-kernel message-based communication protocol primarily used +for network stack configuration. +Netlink is easily extendable and supports large dumps and event +notifications, all via a single socket. +The protocol is fully asynchronous, allowing one to issue and track multiple +requests at once. +Netlink consists of multiple families, which commonly group the commands +belonging to the particular kernel subsystem. +Currently, the supported families are: +.Pp +.Bd -literal -offset indent -compact +NETLINK_ROUTE network configuration, +NETLINK_GENERIC "container" family +.Ed +.Pp +The +.Dv NETLINK_ROUTE +family handles all interfaces, addresses, neighbors, routes, and VNETs +configuration. +More details can be found in +.Xr rtnetlink 4 . +The +.Dv NETLINK_GENERIC +family serves as a +.Do container Dc , +allowing registering other families under the +.Dv NETLINK_GENERIC +umbrella. +This approach allows using a single netlink socket to interact with +multiple netlink families at once. +More details can be found in +.Xr genetlink 4 . +.Pp +Netlink has its own sockaddr structure: +.Bd -literal +struct sockaddr_nl { + uint8_t nl_len; /* sizeof(sockaddr_nl) */ + sa_family_t nl_family; /* netlink family */ + uint16_t nl_pad; /* reserved, set to 0 */ + uint32_t nl_pid; /* automatically selected, set to 0 */ + uint32_t nl_groups; /* multicast groups mask to bind to */ +}; +.Ed +.Pp +Typically, filling this structure is not required for socket operations. +It is presented here for completeness. +.Sh PROTOCOL DESCRIPTION +The protocol is message-based. +Each message starts with the mandatory +.Va nlmsghdr +header, followed by the family-specific header and the list of +type-length-value pairs (TLVs). +TLVs can be nested. +All headers and TLVS are padded to 4-byte boundaries. +Each +.Xr send 2 or +.Xr recv 2 +system call may contain multiple messages. +.Ss BASE HEADER +.Bd -literal +struct nlmsghdr { + uint32_t nlmsg_len; /* Length of message including header */ + uint16_t nlmsg_type; /* Message type identifier */ + uint16_t nlmsg_flags; /* Flags (NLM_F_) */ + uint32_t nlmsg_seq; /* Sequence number */ + uint32_t nlmsg_pid; /* Sending process port ID */ +}; +.Ed +.Pp +The +.Va nlmsg_len +field stores the whole message length, in bytes, including the header. +This length has to be rounded up to the nearest 4-byte boundary when +iterating over messages. +The +.Va nlmsg_type +field represents the command/request type. +This value is family-specific. +The list of supported commands can be found in the relevant family +header file. +.Va nlmsg_seq +is a user-provided request identifier. +An application can track the operation result using the +.Dv NLMSG_ERROR +messages and matching the +.Va nlmsg_seq +. +The +.Va nlmsg_pid +field is the message sender id. +This field is optional for userland. +The kernel sender id is zero. +The +.Va nlmsg_flags +field contains the message-specific flags. +The following generic flags are defined: +.Pp +.Bd -literal -offset indent -compact +NLM_F_REQUEST Indicates that the message is an actual request to the kernel +NLM_F_ACK Request an explicit ACK message with an operation result +.Ed +.Pp +The following generic flags are defined for the "GET" request types: +.Pp +.Bd -literal -offset indent -compact +NLM_F_ROOT Return the whole dataset +NLM_F_MATCH Return all entries matching the criteria +.Ed +These two flags are typically used together, aliased to +.Dv NLM_F_DUMP +.Pp +The following generic flags are defined for the "NEW" request types: +.Pp +.Bd -literal -offset indent -compact +NLM_F_CREATE Create an object if none exists +NLM_F_EXCL Don't replace an object if it exists +NLM_F_REPLACE Replace an existing matching object +NLM_F_APPEND Append to an existing object +.Ed +.Pp +The following generic flags are defined for the replies: +.Pp +.Bd -literal -offset indent -compact +NLM_F_MULTI Indicates that the message is part of the message group +NLM_F_DUMP_INTR Indicates that the state dump was not completed +NLM_F_DUMP_FILTERED Indicates that the dump was filtered per request +NLM_F_CAPPED Indicates the original message was capped to its header +NLM_F_ACK_TLVS Indicates that extended ACK TLVs were included +.Ed +.Ss TLVs +Most messages encode their attributes as type-length-value pairs (TLVs). +The base TLV header: +.Bd -literal +struct nlattr { + uint16_t nla_len; /* Total attribute length */ + uint16_t nla_type; /* Attribute type */ +}; +.Ed +The TLV type +.Pq Va nla_type +scope is typically the message type or group within a family. +For example, the +.Dv RTN_MULTICAST +type value is only valid for +.Dv RTM_NEWROUTE +, +.Dv RTM_DELROUTE +and +.Dv RTM_GETROUTE +messages. +TLVs can be nested; in that case internal TLVs may have their own sub-types. +All TLVs are packed with 4-byte padding. +.Ss CONTROL MESSAGES +A number of generic control messages are reserved in each family. +.Pp +.Dv NLMSG_ERROR +reports the operation result if requested, optionally followed by +the metadata TLVs. +The value of +.Va nlmsg_seq +is set to its value in the original messages, while +.Va nlmsg_pid +is set to the socket pid of the original socket. +The operation result is reported via +.Vt "struct nlmsgerr": +.Bd -literal +struct nlmsgerr { + int error; /* Standard errno */ + struct nlmsghdr msg; /* Original message header */ +}; +.Ed +If the +.Dv NETLINK_CAP_ACK +socket option is not set, the remainder of the original message will follow. +If the +.Dv NETLINK_EXT_ACK +socket option is set, the kernel may add a +.Dv NLMSGERR_ATTR_MSG +string TLV with the textual error description, optionally followed by the +.Dv NLMSGERR_ATTR_OFFS +TLV, indicating the offset from the message start that triggered an error. +Some operations may return additional metadata encapsulated in the +.Dv NLMSGERR_ATTR_COOKIE +TLV. +The metadata format is specific to the operation. +If the operation reply is a multipart message, then no +.Dv NLMSG_ERROR +reply is generated, only a +.Dv NLMSG_DONE +message, closing multipart sequence. +.Pp +.Dv NLMSG_DONE +indicates the end of the message group: typically, the end of the dump. +It contains a single +.Vt int +field, describing the dump result as a standard errno value. +.Sh SOCKET OPTIONS +Netlink supports a number of custom socket options, which can be set with +.Xr setsockopt 2 +with the +.Dv SOL_NETLINK +.Fa level : +.Bl -tag -width indent +.It Dv NETLINK_ADD_MEMBERSHIP +Subscribes to the notifications for the specific group (int). +.It Dv NETLINK_DROP_MEMBERSHIP +Unsubscribes from the notifications for the specific group (int). +.It Dv NETLINK_LIST_MEMBERSHIPS +Lists the memberships as a bitmask. +.It Dv NETLINK_CAP_ACK +Instructs the kernel to send the original message header in the reply +without the message body. +.It Dv NETLINK_EXT_ACK +Acknowledges ability to receive additional TLVs in the ACK message. +.El +.Pp +Additionally, netlink overrides the following socket options from the +.Dv SOL_SOCKET +.Fa level : +.Bl -tag -width indent +.It Dv SO_RCVBUF +Sets the maximum size of the socket receive buffer. +If the caller has +.Dv PRIV_NET_ROUTE +permission, the value can exceed the currently-set +.Va kern.ipc.maxsockbuf +value. +.El +.Sh SYSCTL VARIABLES +A set of +.Xr sysctl 8 +variables is available to tweak run-time parameters: +.Bl -tag -width indent +.It Va net.netlink.sendspace +Default send buffer for the netlink socket. +Note that the socket sendspace has to be at least as long as the longest +message that can be transmitted via this socket. +.El +.Bl -tag -width indent +.It Va net.netlink.recvspace +Default receive buffer for the netlink socket. +Note that the socket recvspace has to be least as long as the longest +message that can be received from this socket. +.El +.Bl -tag -width indent +.It Va net.netlink.nl_maxsockbuf +Maximum receive buffer for the netlink socket that can be set via +.Dv SO_RCVBUF +socket option. +.El +.Sh DEBUGGING +Netlink implements per-functional-unit debugging, with different severities +controllable via the +.Va net.netlink.debug +branch. +These messages are logged in the kernel message buffer and can be seen in +.Xr dmesg 8 +. +The following severity levels are defined: +.Bl -tag -width indent +.It Dv LOG_DEBUG(7) +Rare events or per-socket errors are reported here. +This is the default level, not impacting production performance. +.It Dv LOG_DEBUG2(8) +Socket events such as groups memberships, privilege checks, commands and dumps +are logged. +This level does not incur significant performance overhead. +.It Dv LOG_DEBUG3(9) +All socket events, each dumped or modified entities are logged. +Turning it on may result in significant performance overhead. +.El +.Sh ERRORS +Netlink reports operation results, including errors and error metadata, by +sending a +.Dv NLMSG_ERROR +message for each request message. +The following errors can be returned: +.Bl -tag -width Er +.It Bq Er EPERM +when the current privileges are insufficient to perform the required operation; +.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc +when the system runs out of memory for +an internal data structure; +.It Bq Er ENOTSUP +when the requested command is not supported by the family or +the family is not supported; +.It Bq Er EINVAL +when some necessary TLVs are missing or invalid, detailed info +may be provided in NLMSGERR_ATTR_MSG and NLMSGERR_ATTR_OFFS TLVs; +.It Bq Er ENOENT +when trying to delete a non-existent object. +.Pp +Additionally, a socket operation itself may fail with one of the errors +specified in +.Xr socket 2 +, +.Xr recv 2 +or +.Xr send 2 +. +.El +.Sh SEE ALSO +.Xr genetlink 4 , +.Xr rtnetlink 4 +.Rs +.%A "J. Salim" +.%A "H. Khosravi" +.%A "A. Kleen" +.%A "A. Kuznetsov" +.%T "Linux Netlink as an IP Services Protocol" +.%O "RFC 3549" +.Re +.Sh HISTORY +The netlink protocol appeared in +.Fx 13.2 . +.Sh AUTHORS +The netlink was implemented by +.An -nosplit +.An Alexander Chernikov Aq Mt melifaro@FreeBSD.org . +It was derived from the Google Summer of Code 2021 project by +.An Ng Peng Nam Sean . |