420 lines
17 KiB
Text
420 lines
17 KiB
Text
|
|
||
|
1. Control Interfaces
|
||
|
|
||
|
The interfaces for receiving network packages timestamps are:
|
||
|
|
||
|
* SO_TIMESTAMP
|
||
|
Generates a timestamp for each incoming packet in (not necessarily
|
||
|
monotonic) system time. Reports the timestamp via recvmsg() in a
|
||
|
control message as struct timeval (usec resolution).
|
||
|
|
||
|
* SO_TIMESTAMPNS
|
||
|
Same timestamping mechanism as SO_TIMESTAMP, but reports the
|
||
|
timestamp as struct timespec (nsec resolution).
|
||
|
|
||
|
* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
|
||
|
Only for multicast:approximate transmit timestamp obtained by
|
||
|
reading the looped packet receive timestamp.
|
||
|
|
||
|
* SO_TIMESTAMPING
|
||
|
Generates timestamps on reception, transmission or both. Supports
|
||
|
multiple timestamp sources, including hardware. Supports generating
|
||
|
timestamps for stream sockets.
|
||
|
|
||
|
|
||
|
1.1 SO_TIMESTAMP:
|
||
|
|
||
|
This socket option enables timestamping of datagrams on the reception
|
||
|
path. Because the destination socket, if any, is not known early in
|
||
|
the network stack, the feature has to be enabled for all packets. The
|
||
|
same is true for all early receive timestamp options.
|
||
|
|
||
|
For interface details, see `man 7 socket`.
|
||
|
|
||
|
|
||
|
1.2 SO_TIMESTAMPNS:
|
||
|
|
||
|
This option is identical to SO_TIMESTAMP except for the returned data type.
|
||
|
Its struct timespec allows for higher resolution (ns) timestamps than the
|
||
|
timeval of SO_TIMESTAMP (ms).
|
||
|
|
||
|
|
||
|
1.3 SO_TIMESTAMPING:
|
||
|
|
||
|
Supports multiple types of timestamp requests. As a result, this
|
||
|
socket option takes a bitmap of flags, not a boolean. In
|
||
|
|
||
|
err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);
|
||
|
|
||
|
val is an integer with any of the following bits set. Setting other
|
||
|
bit returns EINVAL and does not change the current state.
|
||
|
|
||
|
|
||
|
1.3.1 Timestamp Generation
|
||
|
|
||
|
Some bits are requests to the stack to try to generate timestamps. Any
|
||
|
combination of them is valid. Changes to these bits apply to newly
|
||
|
created packets, not to packets already in the stack. As a result, it
|
||
|
is possible to selectively request timestamps for a subset of packets
|
||
|
(e.g., for sampling) by embedding an send() call within two setsockopt
|
||
|
calls, one to enable timestamp generation and one to disable it.
|
||
|
Timestamps may also be generated for reasons other than being
|
||
|
requested by a particular socket, such as when receive timestamping is
|
||
|
enabled system wide, as explained earlier.
|
||
|
|
||
|
SOF_TIMESTAMPING_RX_HARDWARE:
|
||
|
Request rx timestamps generated by the network adapter.
|
||
|
|
||
|
SOF_TIMESTAMPING_RX_SOFTWARE:
|
||
|
Request rx timestamps when data enters the kernel. These timestamps
|
||
|
are generated just after a device driver hands a packet to the
|
||
|
kernel receive stack.
|
||
|
|
||
|
SOF_TIMESTAMPING_TX_HARDWARE:
|
||
|
Request tx timestamps generated by the network adapter.
|
||
|
|
||
|
SOF_TIMESTAMPING_TX_SOFTWARE:
|
||
|
Request tx timestamps when data leaves the kernel. These timestamps
|
||
|
are generated in the device driver as close as possible, but always
|
||
|
prior to, passing the packet to the network interface. Hence, they
|
||
|
require driver support and may not be available for all devices.
|
||
|
|
||
|
SOF_TIMESTAMPING_TX_SCHED:
|
||
|
Request tx timestamps prior to entering the packet scheduler. Kernel
|
||
|
transmit latency is, if long, often dominated by queuing delay. The
|
||
|
difference between this timestamp and one taken at
|
||
|
SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
|
||
|
of protocol processing. The latency incurred in protocol
|
||
|
processing, if any, can be computed by subtracting a userspace
|
||
|
timestamp taken immediately before send() from this timestamp. On
|
||
|
machines with virtual devices where a transmitted packet travels
|
||
|
through multiple devices and, hence, multiple packet schedulers,
|
||
|
a timestamp is generated at each layer. This allows for fine
|
||
|
grained measurement of queuing delay.
|
||
|
|
||
|
SOF_TIMESTAMPING_TX_ACK:
|
||
|
Request tx timestamps when all data in the send buffer has been
|
||
|
acknowledged. This only makes sense for reliable protocols. It is
|
||
|
currently only implemented for TCP. For that protocol, it may
|
||
|
over-report measurement, because the timestamp is generated when all
|
||
|
data up to and including the buffer at send() was acknowledged: the
|
||
|
cumulative acknowledgment. The mechanism ignores SACK and FACK.
|
||
|
|
||
|
|
||
|
1.3.2 Timestamp Reporting
|
||
|
|
||
|
The other three bits control which timestamps will be reported in a
|
||
|
generated control message. Changes to the bits take immediate
|
||
|
effect at the timestamp reporting locations in the stack. Timestamps
|
||
|
are only reported for packets that also have the relevant timestamp
|
||
|
generation request set.
|
||
|
|
||
|
SOF_TIMESTAMPING_SOFTWARE:
|
||
|
Report any software timestamps when available.
|
||
|
|
||
|
SOF_TIMESTAMPING_SYS_HARDWARE:
|
||
|
This option is deprecated and ignored.
|
||
|
|
||
|
SOF_TIMESTAMPING_RAW_HARDWARE:
|
||
|
Report hardware timestamps as generated by
|
||
|
SOF_TIMESTAMPING_TX_HARDWARE when available.
|
||
|
|
||
|
|
||
|
1.3.3 Timestamp Options
|
||
|
|
||
|
The interface supports one option
|
||
|
|
||
|
SOF_TIMESTAMPING_OPT_ID:
|
||
|
|
||
|
Generate a unique identifier along with each packet. A process can
|
||
|
have multiple concurrent timestamping requests outstanding. Packets
|
||
|
can be reordered in the transmit path, for instance in the packet
|
||
|
scheduler. In that case timestamps will be queued onto the error
|
||
|
queue out of order from the original send() calls. This option
|
||
|
embeds a counter that is incremented at send() time, to order
|
||
|
timestamps within a flow.
|
||
|
|
||
|
This option is implemented only for transmit timestamps. There, the
|
||
|
timestamp is always looped along with a struct sock_extended_err.
|
||
|
The option modifies field ee_data to pass an id that is unique
|
||
|
among all possibly concurrently outstanding timestamp requests for
|
||
|
that socket. In practice, it is a monotonically increasing u32
|
||
|
(that wraps).
|
||
|
|
||
|
In datagram sockets, the counter increments on each send call. In
|
||
|
stream sockets, it increments with every byte.
|
||
|
|
||
|
|
||
|
1.4 Bytestream Timestamps
|
||
|
|
||
|
The SO_TIMESTAMPING interface supports timestamping of bytes in a
|
||
|
bytestream. Each request is interpreted as a request for when the
|
||
|
entire contents of the buffer has passed a timestamping point. That
|
||
|
is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
|
||
|
when all bytes have reached the device driver, regardless of how
|
||
|
many packets the data has been converted into.
|
||
|
|
||
|
In general, bytestreams have no natural delimiters and therefore
|
||
|
correlating a timestamp with data is non-trivial. A range of bytes
|
||
|
may be split across segments, any segments may be merged (possibly
|
||
|
coalescing sections of previously segmented buffers associated with
|
||
|
independent send() calls). Segments can be reordered and the same
|
||
|
byte range can coexist in multiple segments for protocols that
|
||
|
implement retransmissions.
|
||
|
|
||
|
It is essential that all timestamps implement the same semantics,
|
||
|
regardless of these possible transformations, as otherwise they are
|
||
|
incomparable. Handling "rare" corner cases differently from the
|
||
|
simple case (a 1:1 mapping from buffer to skb) is insufficient
|
||
|
because performance debugging often needs to focus on such outliers.
|
||
|
|
||
|
In practice, timestamps can be correlated with segments of a
|
||
|
bytestream consistently, if both semantics of the timestamp and the
|
||
|
timing of measurement are chosen correctly. This challenge is no
|
||
|
different from deciding on a strategy for IP fragmentation. There, the
|
||
|
definition is that only the first fragment is timestamped. For
|
||
|
bytestreams, we chose that a timestamp is generated only when all
|
||
|
bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
|
||
|
implement and reason about. An implementation that has to take into
|
||
|
account SACK would be more complex due to possible transmission holes
|
||
|
and out of order arrival.
|
||
|
|
||
|
On the host, TCP can also break the simple 1:1 mapping from buffer to
|
||
|
skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
|
||
|
implementation ensures correctness in all cases by tracking the
|
||
|
individual last byte passed to send(), even if it is no longer the
|
||
|
last byte after an skbuff extend or merge operation. It stores the
|
||
|
relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
|
||
|
has only one such field, only one timestamp can be generated.
|
||
|
|
||
|
In rare cases, a timestamp request can be missed if two requests are
|
||
|
collapsed onto the same skb. A process can detect this situation by
|
||
|
enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
|
||
|
send time with the value returned for each timestamp. It can prevent
|
||
|
the situation by always flushing the TCP stack in between requests,
|
||
|
for instance by enabling TCP_NODELAY and disabling TCP_CORK and
|
||
|
autocork.
|
||
|
|
||
|
These precautions ensure that the timestamp is generated only when all
|
||
|
bytes have passed a timestamp point, assuming that the network stack
|
||
|
itself does not reorder the segments. The stack indeed tries to avoid
|
||
|
reordering. The one exception is under administrator control: it is
|
||
|
possible to construct a packet scheduler configuration that delays
|
||
|
segments from the same stream differently. Such a setup would be
|
||
|
unusual.
|
||
|
|
||
|
|
||
|
2 Data Interfaces
|
||
|
|
||
|
Timestamps are read using the ancillary data feature of recvmsg().
|
||
|
See `man 3 cmsg` for details of this interface. The socket manual
|
||
|
page (`man 7 socket`) describes how timestamps generated with
|
||
|
SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
|
||
|
|
||
|
|
||
|
2.1 SCM_TIMESTAMPING records
|
||
|
|
||
|
These timestamps are returned in a control message with cmsg_level
|
||
|
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
|
||
|
|
||
|
struct scm_timestamping {
|
||
|
struct timespec ts[3];
|
||
|
};
|
||
|
|
||
|
The structure can return up to three timestamps. This is a legacy
|
||
|
feature. Only one field is non-zero at any time. Most timestamps
|
||
|
are passed in ts[0]. Hardware timestamps are passed in ts[2].
|
||
|
|
||
|
ts[1] used to hold hardware timestamps converted to system time.
|
||
|
Instead, expose the hardware clock device on the NIC directly as
|
||
|
a HW PTP clock source, to allow time conversion in userspace and
|
||
|
optionally synchronize system time with a userspace PTP stack such
|
||
|
as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt.
|
||
|
|
||
|
2.1.1 Transmit timestamps with MSG_ERRQUEUE
|
||
|
|
||
|
For transmit timestamps the outgoing packet is looped back to the
|
||
|
socket's error queue with the send timestamp(s) attached. A process
|
||
|
receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
|
||
|
set and with a msg_control buffer sufficiently large to receive the
|
||
|
relevant metadata structures. The recvmsg call returns the original
|
||
|
outgoing data packet with two ancillary messages attached.
|
||
|
|
||
|
A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
|
||
|
embeds a struct sock_extended_err. This defines the error type. For
|
||
|
timestamps, the ee_errno field is ENOMSG. The other ancillary message
|
||
|
will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
|
||
|
embeds the struct scm_timestamping.
|
||
|
|
||
|
|
||
|
2.1.1.2 Timestamp types
|
||
|
|
||
|
The semantics of the three struct timespec are defined by field
|
||
|
ee_info in the extended error structure. It contains a value of
|
||
|
type SCM_TSTAMP_* to define the actual timestamp passed in
|
||
|
scm_timestamping.
|
||
|
|
||
|
The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
|
||
|
control fields discussed previously, with one exception. For legacy
|
||
|
reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
|
||
|
SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
|
||
|
is the first if ts[2] is non-zero, the second otherwise, in which
|
||
|
case the timestamp is stored in ts[0].
|
||
|
|
||
|
|
||
|
2.1.1.3 Fragmentation
|
||
|
|
||
|
Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
|
||
|
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
|
||
|
then only the first fragment is timestamped and returned to the sending
|
||
|
socket.
|
||
|
|
||
|
|
||
|
2.1.1.4 Packet Payload
|
||
|
|
||
|
The calling application is often not interested in receiving the whole
|
||
|
packet payload that it passed to the stack originally: the socket
|
||
|
error queue mechanism is just a method to piggyback the timestamp on.
|
||
|
In this case, the application can choose to read datagrams with a
|
||
|
smaller buffer, possibly even of length 0. The payload is truncated
|
||
|
accordingly. Until the process calls recvmsg() on the error queue,
|
||
|
however, the full packet is queued, taking up budget from SO_RCVBUF.
|
||
|
|
||
|
|
||
|
2.1.1.5 Blocking Read
|
||
|
|
||
|
Reading from the error queue is always a non-blocking operation. To
|
||
|
block waiting on a timestamp, use poll or select. poll() will return
|
||
|
POLLERR in pollfd.revents if any data is ready on the error queue.
|
||
|
There is no need to pass this flag in pollfd.events. This flag is
|
||
|
ignored on request. See also `man 2 poll`.
|
||
|
|
||
|
|
||
|
2.1.2 Receive timestamps
|
||
|
|
||
|
On reception, there is no reason to read from the socket error queue.
|
||
|
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
|
||
|
on a normal recvmsg(). Since this is not a socket error, it is not
|
||
|
accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
|
||
|
the meaning of the three fields in struct scm_timestamping is
|
||
|
implicitly defined. ts[0] holds a software timestamp if set, ts[1]
|
||
|
is again deprecated and ts[2] holds a hardware timestamp if set.
|
||
|
|
||
|
|
||
|
3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
|
||
|
|
||
|
Hardware time stamping must also be initialized for each device driver
|
||
|
that is expected to do hardware time stamping. The parameter is defined in
|
||
|
/include/linux/net_tstamp.h as:
|
||
|
|
||
|
struct hwtstamp_config {
|
||
|
int flags; /* no flags defined right now, must be zero */
|
||
|
int tx_type; /* HWTSTAMP_TX_* */
|
||
|
int rx_filter; /* HWTSTAMP_FILTER_* */
|
||
|
};
|
||
|
|
||
|
Desired behavior is passed into the kernel and to a specific device by
|
||
|
calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
|
||
|
ifr_data points to a struct hwtstamp_config. The tx_type and
|
||
|
rx_filter are hints to the driver what it is expected to do. If
|
||
|
the requested fine-grained filtering for incoming packets is not
|
||
|
supported, the driver may time stamp more than just the requested types
|
||
|
of packets.
|
||
|
|
||
|
A driver which supports hardware time stamping shall update the struct
|
||
|
with the actual, possibly more permissive configuration. If the
|
||
|
requested packets cannot be time stamped, then nothing should be
|
||
|
changed and ERANGE shall be returned (in contrast to EINVAL, which
|
||
|
indicates that SIOCSHWTSTAMP is not supported at all).
|
||
|
|
||
|
Only a processes with admin rights may change the configuration. User
|
||
|
space is responsible to ensure that multiple processes don't interfere
|
||
|
with each other and that the settings are reset.
|
||
|
|
||
|
Any process can read the actual configuration by passing this
|
||
|
structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
|
||
|
not been implemented in all drivers.
|
||
|
|
||
|
/* possible values for hwtstamp_config->tx_type */
|
||
|
enum {
|
||
|
/*
|
||
|
* no outgoing packet will need hardware time stamping;
|
||
|
* should a packet arrive which asks for it, no hardware
|
||
|
* time stamping will be done
|
||
|
*/
|
||
|
HWTSTAMP_TX_OFF,
|
||
|
|
||
|
/*
|
||
|
* enables hardware time stamping for outgoing packets;
|
||
|
* the sender of the packet decides which are to be
|
||
|
* time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
|
||
|
* before sending the packet
|
||
|
*/
|
||
|
HWTSTAMP_TX_ON,
|
||
|
};
|
||
|
|
||
|
/* possible values for hwtstamp_config->rx_filter */
|
||
|
enum {
|
||
|
/* time stamp no incoming packet at all */
|
||
|
HWTSTAMP_FILTER_NONE,
|
||
|
|
||
|
/* time stamp any incoming packet */
|
||
|
HWTSTAMP_FILTER_ALL,
|
||
|
|
||
|
/* return value: time stamp all packets requested plus some others */
|
||
|
HWTSTAMP_FILTER_SOME,
|
||
|
|
||
|
/* PTP v1, UDP, any kind of event packet */
|
||
|
HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
|
||
|
|
||
|
/* for the complete list of values, please check
|
||
|
* the include file /include/linux/net_tstamp.h
|
||
|
*/
|
||
|
};
|
||
|
|
||
|
3.1 Hardware Timestamping Implementation: Device Drivers
|
||
|
|
||
|
A driver which supports hardware time stamping must support the
|
||
|
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
|
||
|
the actual values as described in the section on SIOCSHWTSTAMP. It
|
||
|
should also support SIOCGHWTSTAMP.
|
||
|
|
||
|
Time stamps for received packets must be stored in the skb. To get a pointer
|
||
|
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
|
||
|
set the time stamps in the structure:
|
||
|
|
||
|
struct skb_shared_hwtstamps {
|
||
|
/* hardware time stamp transformed into duration
|
||
|
* since arbitrary point in time
|
||
|
*/
|
||
|
ktime_t hwtstamp;
|
||
|
};
|
||
|
|
||
|
Time stamps for outgoing packets are to be generated as follows:
|
||
|
- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
|
||
|
is set no-zero. If yes, then the driver is expected to do hardware time
|
||
|
stamping.
|
||
|
- If this is possible for the skb and requested, then declare
|
||
|
that the driver is doing the time stamping by setting the flag
|
||
|
SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with
|
||
|
|
||
|
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
|
||
|
|
||
|
You might want to keep a pointer to the associated skb for the next step
|
||
|
and not free the skb. A driver not supporting hardware time stamping doesn't
|
||
|
do that. A driver must never touch sk_buff::tstamp! It is used to store
|
||
|
software generated time stamps by the network subsystem.
|
||
|
- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
|
||
|
as possible. skb_tx_timestamp() provides a software time stamp if requested
|
||
|
and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
|
||
|
- As soon as the driver has sent the packet and/or obtained a
|
||
|
hardware time stamp for it, it passes the time stamp back by
|
||
|
calling skb_hwtstamp_tx() with the original skb, the raw
|
||
|
hardware time stamp. skb_hwtstamp_tx() clones the original skb and
|
||
|
adds the timestamps, therefore the original skb has to be freed now.
|
||
|
If obtaining the hardware time stamp somehow fails, then the driver
|
||
|
should not fall back to software time stamping. The rationale is that
|
||
|
this would occur at a later time in the processing pipeline than other
|
||
|
software time stamping and therefore could lead to unexpected deltas
|
||
|
between time stamps.
|