NETMAP(4) | Device Drivers Manual | NETMAP(4) |
netmap
— a
framework for fast packet I/O
device netmap
netmap
is a framework for extremely fast
and efficient packet I/O for userspace and kernel clients, and for Virtual
Machines. It runs on FreeBSD Linux and some versions
of Windows, and supports a variety of netmap ports
,
including
physical NIC ports
host ports
VALE ports
netmap pipes
netmap monitors
All these netmap ports
are accessed
interchangeably with the same API, and are at least one order of magnitude
faster than standard OS mechanisms (sockets, bpf, tun/tap interfaces, native
switches, pipes). With suitably fast hardware (NICs, PCIe buses, CPUs),
packet I/O using netmap
on supported NICs reaches
14.88 million packets per second (Mpps) with much less than one core on 10
Gbit/s NICs; 35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); about
20 Mpps per core for VALE ports; and over 100 Mpps for
netmap pipes
. NICs without native
netmap
support can still use the API in emulated
mode, which uses unmodified device drivers and is 3-5 times faster than
bpf(4) or raw sockets.
Userspace clients can dynamically switch NICs into
netmap
mode and send and receive raw packets through
memory mapped buffers. Similarly, VALE
switch
instances and ports, netmap pipes
and
netmap monitors
can be created dynamically,
providing high speed packet I/O between processes, virtual machines, NICs
and the host stack.
netmap
supports both non-blocking I/O
through ioctl(2), synchronization and blocking I/O through
a file descriptor and standard OS mechanisms such as
select(2), poll(2),
kqueue(2) and epoll(7). All types of
netmap ports
and the VALE
switch
are implemented by a single kernel module, which also emulates
the netmap
API over standard drivers. For best
performance, netmap
requires native support in
device drivers. A list of such devices is at the end of this document.
In the rest of this (long) manual page we document various aspects
of the netmap
and VALE
architecture, features and usage.
netmap
supports raw packet I/O through a
port,
which can be connected to a physical interface
(NIC), to
the host stack, or to a VALE
switch. Ports use
preallocated circular queues of buffers
(rings)
residing in an mmapped region. There is one ring for each transmit/receive
queue of a NIC or virtual port. An additional ring pair connects to the host
stack.
After binding a file descriptor to a port, a
netmap
client can send or receive packets in batches
through the rings, and possibly implement zero-copy forwarding between
ports.
All NICs operating in netmap
mode use the
same memory region, accessible to all processes who own
/dev/netmap file descriptors bound to NICs.
Independent VALE
and netmap
pipe
ports by default use separate memory regions, but can be
independently configured to share memory.
The following section describes the system calls to create and
control netmap
ports (including
VALE
and netmap pipe
ports).
Simpler, higher level functions are described in the
LIBRARIES section.
Ports and rings are created and controlled through a file descriptor, created by opening a special device
fd =
open("/dev/netmap");
ioctl(fd, NIOCREGIF, (struct nmreq
*)arg);
netmap
has multiple modes of operation
controlled by the struct nmreq argument.
arg.nr_name specifies the netmap port name, as
follows:
OS network
interface name (e.g., 'em0', 'eth1', ...
)valeSSS:PPP
Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string cannot exceed IFNAMSIZ characters, and PPP cannot be the name of any existing OS network interface.
On return, arg indicates the size of the
shared memory region, and the number, size and location of all the
netmap
data structures, which can be accessed by
mmapping the memory
char *mem = mmap(0, arg.nr_memsize,
fd);
Non-blocking I/O is done with special ioctl(2) select(2) and poll(2) on the file descriptor permit blocking I/O.
While a NIC is in netmap
mode, the OS will
still believe the interface is up and running. OS-generated packets for that
NIC end up into a netmap
ring, and another ring is
used to send packets into the OS network stack. A close(2)
on the file descriptor removes the binding, and returns the NIC to normal
mode (reconnecting the data path to the host stack), or destroys the virtual
port.
The data structures in the mmapped memory region are detailed in
<sys/net/netmap.h>
, which is
the ultimate reference for the netmap
API. The main
structures and fields are indicated below:
struct
netmap_if (one per interface
)struct netmap_if { ... const uint32_t ni_flags; /* properties */ ... const uint32_t ni_tx_rings; /* NIC tx rings */ const uint32_t ni_rx_rings; /* NIC rx rings */ uint32_t ni_bufs_head; /* head of extra bufs list */ ... };
Indicates the number of available rings
(struct netmap_rings) and their position in the
mmapped region. The number of tx and rx rings
(ni_tx_rings,
ni_rx_rings) normally depends on the hardware.
NICs also have an extra tx/rx ring pair connected to the host stack.
NIOCREGIF can also request additional unbound buffers
in the same memory space, to be used as temporary storage for packets.
The number of extra buffers is specified in the
arg.nr_arg3 field. On success, the kernel writes
back to arg.nr_arg3 the number of extra buffers
actually allocated (they may be less than the amount requested if the
memory space ran out of buffers). ni_bufs_head
contains the index of the first of these extra buffers, which are
connected in a list (the first uint32_t of each buffer being the index
of the next buffer in the list). A 0
indicates
the end of the list. The application is free to modify this list and use
the buffers (i.e., binding them to the slots of a netmap ring). When
closing the netmap file descriptor, the kernel frees the buffers
contained in the list pointed by ni_bufs_head ,
irrespectively of the buffers originally provided by the kernel on
NIOCREGIF.
struct
netmap_ring (one per ring
)struct netmap_ring { ... const uint32_t num_slots; /* slots in each ring */ const uint32_t nr_buf_size; /* size of each buffer */ ... uint32_t head; /* (u) first buf owned by user */ uint32_t cur; /* (u) wakeup position */ const uint32_t tail; /* (k) first buf owned by kernel */ ... uint32_t flags; struct timeval ts; /* (k) time of last rxsync() */ ... struct netmap_slot slot[0]; /* array of slots */ }
Implements transmit and receive rings, with read/write pointers, metadata and an array of slots describing the buffers.
struct
netmap_slot (one per buffer
)struct netmap_slot { uint32_t buf_idx; /* buffer index */ uint16_t len; /* packet length */ uint16_t flags; /* buf changed, etc. */ uint64_t ptr; /* address for indirect buffers */ };
Describes a packet buffer, which normally is identified by an index and resides in the mmapped region.
packet
buffers
The offset of the struct netmap_if in the
mmapped region is indicated by the nr_offset field
in the structure returned by NIOCREGIF
. From there,
all other objects are reachable through relative references (offsets or
indexes). Macros and functions in
<net/netmap_user.h>
help
converting them into actual pointers:
struct netmap_if *nifp =
NETMAP_IF(mem, arg.nr_offset);
struct netmap_ring *txr =
NETMAP_TXRING(nifp, ring_index);
struct netmap_ring *rxr =
NETMAP_RXRING(nifp, ring_index);
char *buf = NETMAP_BUF(ring,
buffer_index);
Rings are circular queues of packets with three indexes/pointers (head, cur, tail); one slot is always kept empty. The ring size (num_slots) should not be assumed to be a power of two.
head is the first slot available to userspace;
cur is the wakeup point: select/poll will unblock when tail passes cur;
tail is the first slot reserved to the kernel.
Slot indexes must only move forward; for convenience, the function
nm_ring_next(ring,
index)
head and cur are only modified by the user program; tail is only modified by the kernel. The kernel only reads/writes the struct netmap_ring slots and buffers during the execution of a netmap-related system call. The only exception are slots (and buffers) in the range tail ... head-1, that are explicitly assigned to the kernel.
On transmit rings, after a netmap
system
call, slots in the range head ...
tail-1 are available for transmission. User code
should fill the slots sequentially and advance head
and cur past slots ready to transmit.
cur may be moved further ahead if the user code needs
more slots before further transmissions (see
SCATTER GATHER I/O).
At the next NIOCTXSYNC/select()/poll(), slots up to head-1 are pushed to the port, and tail may advance if further slots have become available. Below is an example of the evolution of a TX ring:
after the syscall, slots between cur and tail are (a)vailable head=cur tail | | v v TX [.....aaaaaaaaaaa.............] user creates new packets to (T)ransmit head=cur tail | | v v TX [.....TTTTTaaaaaa.............] NIOCTXSYNC/poll()/select() sends packets and reports new slots head=cur tail | | v v TX [..........aaaaaaaaaaa........]
select
()
and
poll
()
will block if there is no space in the ring, i.e.,
ring->cur ==
ring->tail
High speed applications may want to amortize the cost of system calls by preparing as many packets as possible before issuing them.
A transmit ring with pending transmissions has
ring->head != ring->tail + 1
(modulo the ring size).
On receive rings, after a netmap
system
call, the slots in the range head...
tail-1 contain received packets. User code should
process them and advance head and
cur past slots it wants to return to the kernel.
cur may be moved further ahead if the user code wants
to wait for more packets without returning all the previous slots to the
kernel.
At the next NIOCRXSYNC/select()/poll(), slots up to head-1 are returned to the kernel for further receives, and tail may advance to report new incoming packets.
Below is an example of the evolution of an RX ring:
after the syscall, there are some (h)eld and some (R)eceived slots head cur tail | | | v v v RX [..hhhhhhRRRRRRRR..........] user advances head and cur, releasing some slots and holding others head cur tail | | | v v v RX [..*****hhhRRRRRR...........] NICRXSYNC/poll()/select() recovers slots and reports new packets head cur tail | | | v v v RX [.......hhhRRRRRRRRRRRR....]
Normally, packets should be stored in the netmap-allocated buffers assigned to slots when ports are bound to a file descriptor. One packet is fully contained in a single buffer.
The following flags affect slot and buffer processing:
netmap
notifies transmit completions in batches,
hence signals can be delayed indefinitely. This flag helps detect when
packets have been sent and a file descriptor can be closed.This is only supported on the transmit ring of
VALE
ports, and it helps reducing data copies in
the interconnection of virtual machines.
Packets can span multiple slots if the
NS_MOREFRAG flag is set in all but the last slot. The
maximum length of a chain is 64 buffers. This is normally used with
VALE
ports when connecting virtual machines, as they
generate large TSO segments that are not split unless they reach a physical
device.
NOTE: The length field always refers to the individual fragment; there is no place with the total length of a packet.
On receive rings the macro NS_RFRAGS(slot) indicates the remaining number of slots for this packet, including the current one. Slots with a value greater than 1 also have NS_MOREFRAG set.
netmap
uses two ioctls (NIOCTXSYNC,
NIOCRXSYNC) for non-blocking I/O. They take no argument. Two more ioctls
(NIOCGINFO, NIOCREGIF) are used to query and configure ports, with the
following argument:
struct nmreq { char nr_name[IFNAMSIZ]; /* (i) port name */ uint32_t nr_version; /* (i) API version */ uint32_t nr_offset; /* (o) nifp offset in mmap region */ uint32_t nr_memsize; /* (o) size of the mmap region */ uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ uint16_t nr_tx_rings; /* (i/o) number of tx rings */ uint16_t nr_rx_rings; /* (i/o) number of rx rings */ uint16_t nr_ringid; /* (i/o) ring(s) we care about */ uint16_t nr_cmd; /* (i) special command */ uint16_t nr_arg1; /* (i/o) extra arguments */ uint16_t nr_arg2; /* (i/o) extra arguments */ uint32_t nr_arg3; /* (i/o) extra arguments */ uint32_t nr_flags /* (i/o) open mode */ ... };
A file descriptor obtained through /dev/netmap also supports the ioctl supported by network devices, see netintro(4).
NIOCGINFO
netmap
memory
region. NICs in netmap
mode all share the same
memory region, whereas VALE
ports have
independent regions for each port.NIOCREGIF
netmap
mode, disconnecting it from the host stack.
Multiple file descriptors can be bound to the same port, with proper
synchronization left to the user.
The recommended way to bind a file descriptor to a port is to use function nm_open(..) (see LIBRARIES) which parses names to access specific port types and enable features. In the following we document the main features.
NIOCREGIF can also bind a file
descriptor to one endpoint of a
netmap pipe,
consisting of two netmap ports with a crossover connection. A netmap
pipe share the same memory space of the parent port, and is meant to
enable configuration where a master process acts as a dispatcher towards
slave processes.
To enable this function, the nr_arg1 field of the structure can be used as a hint to the kernel to indicate how many pipes we expect to use, and reserve extra space in the memory region.
On return, it gives the same info as NIOCGINFO, with nr_ringid and nr_flags indicating the identity of the rings controlled through the file descriptor.
nr_flags nr_ringid
selects which rings are controlled through this file descriptor.
Possible values of nr_flags are indicated below,
together with the naming schemes that application libraries (such as the
nm_open
indicated below) can use to indicate the
specific set of rings. In the example below, "netmap:foo" is
any valid netmap port name.
The identifier of a pipe must be thought as part of the pipe name, and does not need to be sequential. On return the pipe will only have a single ring pair with index 0, irrespective of the value of i.
By default, a poll(2) or select(2) call pushes out any pending packets on the transmit ring, even if no write events are specified. The feature can be disabled by or-ing NETMAP_NO_TX_POLL to the value written to nr_ringid. When this feature is used, packets are transmitted only on ioctl(NIOCTXSYNC) or select() / poll() are called with a write event (POLLOUT/wfdset) or a full ring.
When registering a virtual interface that is dynamically
created to a VALE
switch, we can specify the
desired number of rings (1 by default, and currently up to 16) on it
using nr_tx_rings and nr_rx_rings fields.
NIOCTXSYNC
NIOCRXSYNC
select(2) and poll(2) on a
netmap
file descriptor process rings as indicated in
TRANSMIT RINGS and
RECEIVE RINGS, respectively when
write (POLLOUT) and read (POLLIN) events are requested. Both block if no
slots are available in the ring (ring->cur ==
ring->tail). Depending on the platform, epoll(7)
and kqueue(2) are supported too.
Packets in transmit rings are normally pushed out (and buffers
reclaimed) even without requesting write events. Passing the
NETMAP_NO_TX_POLL
flag to
NIOCREGIF disables this feature. By default, receive rings
are processed only if read events are requested. Passing the
NETMAP_DO_RX_POLL
flag to NIOCREGIF
updates receive rings even without read events. Note that on
epoll(7) and kqueue(2),
NETMAP_NO_TX_POLL
and
NETMAP_DO_RX_POLL
only have an effect when some
event is posted for the file descriptor.
The netmap
API is supposed to be used
directly, both because of its simplicity and for efficient integration with
applications.
For convenience, the
<net/netmap_user.h>
header
provides a few macros and functions to ease creating a file descriptor and
doing I/O with a netmap
port. These are loosely
modeled after the pcap(3) API, to ease porting of
libpcap-based applications to netmap
. To use these
extra functions, programs should
#define NETMAP_WITH_LIBS
#include
<net/netmap_user.h>
The following functions are available:
VALE
port.netmap
natively supports the following
devices:
On FreeBSD: cxgbe(4), em(4), iflib(4) (providing igb, em and lem), ixgbe(4), ixl(4), re(4), vtnet(4).
On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
NICs without native support can still be used in
netmap
mode through emulation. Performance is
inferior to native netmap mode but still significantly higher than various
raw socket types (bpf, PF_PACKET, etc.). Note that for slow devices (such as
1 Gbit/s and slower NICs, or several 10 Gbit/s NICs whose hardware is unable
to sustain line rate), emulated and native mode will likely have similar or
same throughput.
When emulation is in use, packet sniffer programs such as tcpdump could see received packets before they are diverted by netmap. This behaviour is not intentional, being just an artifact of the implementation of emulation. Note that in case the netmap application subsequently moves packets received from the emulated adapter onto the host RX ring, the sniffer will intercept those packets again, since the packets are injected to the host stack as they were received by the network interface.
Emulation is also available for devices with native netmap support, which can be used for testing or performance comparison. The sysctl variable dev.netmap.admode globally controls how netmap mode is implemented.
Some aspects of the operation of netmap
and VALE
are controlled through sysctl variables on
FreeBSD
(dev.netmap.*)
and module parameters on Linux
(/sys/module/netmap/parameters/*):
0 uses the best available option;
1 forces native mode and fails if not available;
2 forces emulated hence never fails.
VALE
flush
functionVALE
port and each pair of netmap
pipes
.VALE
switch. Values above 64 generally guarantee good performance.netmap
uses select(2),
poll(2), epoll(7) and
kqueue(2) to wake up processes when significant events
occur, and mmap(2) to map memory.
ioctl(2) is used to configure ports and
VALE switches
.
Applications may need to create threads and bind them to specific cores to improve performance, using standard OS primitives, see pthread(3). In particular, pthread_setaffinity_np(3) may be of use.
netmap
comes with a few programs that can
be used for testing or simple applications. See the
examples/ directory in
netmap
distributions, or
tools/tools/netmap/ directory in
FreeBSD distributions.
pkt-gen(8) is a general purpose traffic source/sink.
As an example
pkt-gen -i ix0 -f tx -l
60
pkt-gen -i ix0 -f rx
pkt-gen(8) has many options can be uses to set packet sizes, addresses, rates, and use multiple send/receive threads and cores.
bridge(4) is another test program which
interconnects two netmap
ports. It can be used for
transparent forwarding between interfaces, as in
bridge -i netmap:ix0 -i
netmap:ix1
bridge -i netmap:ix0
The following code implements a traffic generator
#include <net/netmap_user.h> ... void sender(void) { struct netmap_if *nifp; struct netmap_ring *ring; struct nmreq nmr; struct pollfd fds; fd = open("/dev/netmap", O_RDWR); bzero(&nmr, sizeof(nmr)); strcpy(nmr.nr_name, "ix0"); nmr.nm_version = NETMAP_API; ioctl(fd, NIOCREGIF, &nmr); p = mmap(0, nmr.nr_memsize, fd); nifp = NETMAP_IF(p, nmr.nr_offset); ring = NETMAP_TXRING(nifp, 0); fds.fd = fd; fds.events = POLLOUT; for (;;) { poll(&fds, 1, -1); while (!nm_ring_empty(ring)) { i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_index); ... prepare packet in buf ... ring->slot[i].len = ... packet length ... ring->head = ring->cur = nm_ring_next(ring, i); } } }
A simple receiver can be implemented using the helper functions
#define NETMAP_WITH_LIBS #include <net/netmap_user.h> ... void receiver(void) { struct nm_desc *d; struct pollfd fds; u_char *buf; struct nm_pkthdr h; ... d = nm_open("netmap:ix0", NULL, 0, 0); fds.fd = NETMAP_FD(d); fds.events = POLLIN; for (;;) { poll(&fds, 1, -1); while ( (buf = nm_nextpkt(d, &h)) ) consume_pkt(buf, h->len); } nm_close(d); }
Since physical interfaces share the same memory region, it is possible to do packet forwarding between ports swapping buffers. The buffer from the transmit ring is used to replenish the receive ring:
uint32_t tmp; struct netmap_slot *src, *dst; ... src = &src_ring->slot[rxr->cur]; dst = &dst_ring->slot[txr->cur]; tmp = dst->buf_idx; dst->buf_idx = src->buf_idx; dst->len = src->len; dst->flags = NS_BUF_CHANGED; src->buf_idx = tmp; src->flags = NS_BUF_CHANGED; rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); txr->head = txr->cur = nm_ring_next(txr, txr->cur); ...
The host stack is for all practical purposes just a regular ring pair, which you can access with the netmap API (e.g., with
nm_open("netmap:eth0^",
...
netmap
mode end up into the RX ring, whereas all
packets queued to the TX ring are send up to the host stack.
A simple way to test the performance of a
VALE
switch is to attach a sender and a receiver to
it, e.g., running the following in two different terminals:
pkt-gen -i vale1:a -f rx #
receiver
pkt-gen -i vale1:b -f tx #
sender
pkt-gen -i vale2:x{3 -f rx # receiver
on the master side
pkt-gen -i vale2:x}3 -f tx # sender
on the slave side
The following command attaches an interface and the host stack to a switch:
valectl -h vale2:em0
netmap
clients attached to the same switch can now
communicate with the network card or the host.
vale(4), valectl(8), bridge(8), lb(8), nmreplay(8), pkt-gen(8)
http://info.iet.unipi.it/~luigi/netmap/
Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, Communications of the ACM, 55 (3), pp.45-51, March 2012
Luigi Rizzo, netmap: a novel framework for fast packet I/O, Usenix ATC'12, June 2012, Boston
Luigi Rizzo, Giuseppe Lettieri, VALE, a switched ethernet for virtual machines, ACM CoNEXT'12, December 2012, Nice
Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, Speeding up packet I/O in virtual machines, ACM/IEEE ANCS'13, October 2013, San Jose
The netmap
framework has been originally
designed and implemented at the Universita` di Pisa in 2011 by
Luigi Rizzo, and further extended with help from
Matteo Landi, Gaetano
Catalli, Giuseppe Lettieri, and
Vincenzo Maffione.
netmap
and VALE
have been funded by the European Commission within FP7 Projects CHANGE
(257422) and OPENLAB (287581).
No matter how fast the CPU and OS are, achieving line rate on 10G and faster interfaces requires hardware with sufficient performance. Several NICs are unable to sustain line rate with small packet sizes. Insufficient PCIe or memory bandwidth can also cause reduced performance.
Another frequent reason for low performance is the use of flow control on the link: a slow receiver can limit the transmit speed. Be sure to disable flow control when running high speed experiments.
netmap
is orthogonal to some NIC features
such as multiqueue, schedulers, packet filters.
Multiple transmit and receive rings are supported natively and can be configured with ordinary OS tools, such as ethtool(8) or device-specific sysctl variables. The same goes for Receive Packet Steering (RPS) and filtering of incoming traffic.
netmap
does not use
features such as
checksum
offloading,
TCP segmentation
offloading,
encryption,
VLAN
encapsulation/decapsulation, etc. When using netmap to exchange
packets with the host stack, make sure to disable these features.
February 6, 2020 | Debian |