1 | ## @file
|
---|
2 | #
|
---|
3 | # Technical notes for the virtio-net driver.
|
---|
4 | #
|
---|
5 | # Copyright (C) 2013, Red Hat, Inc.
|
---|
6 | #
|
---|
7 | # This program and the accompanying materials are licensed and made available
|
---|
8 | # under the terms and conditions of the BSD License which accompanies this
|
---|
9 | # distribution. The full text of the license may be found at
|
---|
10 | # http://opensource.org/licenses/bsd-license.php
|
---|
11 | #
|
---|
12 | # THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT
|
---|
13 | # WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
|
---|
14 | #
|
---|
15 | ##
|
---|
16 |
|
---|
17 | Disclaimer
|
---|
18 | ----------
|
---|
19 |
|
---|
20 | All statements concerning standards and specifications are informative and not
|
---|
21 | normative. They are made in good faith. Corrections are most welcome on the
|
---|
22 | edk2-devel mailing list.
|
---|
23 |
|
---|
24 | The following documents have been perused while writing the driver and this
|
---|
25 | document:
|
---|
26 | - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C;
|
---|
27 | June 27, 2012
|
---|
28 | - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01;
|
---|
29 | - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7.
|
---|
30 |
|
---|
31 |
|
---|
32 | Summary
|
---|
33 | -------
|
---|
34 |
|
---|
35 | The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for
|
---|
36 | virtio-net devices. Higher level protocols are automatically installed on top
|
---|
37 | of it by the DXE Core / the ConnectController() boot service, enabling for
|
---|
38 | virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib
|
---|
39 | applications, and PXE booting in OVMF.
|
---|
40 |
|
---|
41 |
|
---|
42 | UEFI driver structure
|
---|
43 | ---------------------
|
---|
44 |
|
---|
45 | A driver instance, belonging to a given virtio-net device, can be in one of
|
---|
46 | four states at any time. The states stack up as follows below. The state
|
---|
47 | transitions are labeled with the primary function (and its important callees
|
---|
48 | faithfully indented) that implement the transition.
|
---|
49 |
|
---|
50 | | ^
|
---|
51 | | |
|
---|
52 | [DriverBinding.c] | | [DriverBinding.c]
|
---|
53 | VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop
|
---|
54 | VirtioNetSnpPopulate | | VirtioNetSnpEvacuate
|
---|
55 | VirtioNetGetFeatures | |
|
---|
56 | v |
|
---|
57 | +-------------------------+
|
---|
58 | | EfiSimpleNetworkStopped |
|
---|
59 | +-------------------------+
|
---|
60 | | ^
|
---|
61 | [SnpStart.c] | | [SnpStop.c]
|
---|
62 | VirtioNetStart | | VirtioNetStop
|
---|
63 | | |
|
---|
64 | v |
|
---|
65 | +-------------------------+
|
---|
66 | | EfiSimpleNetworkStarted |
|
---|
67 | +-------------------------+
|
---|
68 | | ^
|
---|
69 | [SnpInitialize.c] | | [SnpShutdown.c]
|
---|
70 | VirtioNetInitialize | | VirtioNetShutdown
|
---|
71 | VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c]
|
---|
72 | VirtioRingInit | | VirtIo->UnmapSharedBuffer
|
---|
73 | VirtioRingMap | | VirtIo->FreeSharedPages
|
---|
74 | VirtioNetInitTx | | VirtioNetShutdownTx [SnpSharedHelpers.c]
|
---|
75 | VirtIo->AllocateShare... | | VirtIo->UnmapSharedBuffer
|
---|
76 | VirtioMapAllBytesInSh... | | VirtIo->FreeSharedPages
|
---|
77 | VirtioNetInitRx | | VirtioNetUninitRing [SnpSharedHelpers.c]
|
---|
78 | VirtIo->AllocateShare... | | {Tx, Rx}
|
---|
79 | VirtioMapAllBytesInSh... | | VirtIo->UnmapSharedBuffer
|
---|
80 | | | VirtioRingUninit
|
---|
81 | v |
|
---|
82 | +-----------------------------+
|
---|
83 | | EfiSimpleNetworkInitialized |
|
---|
84 | +-----------------------------+
|
---|
85 |
|
---|
86 | The state at the top means "nonexistent" and is hence unnamed on the diagram --
|
---|
87 | a driver instance actually doesn't exist at that point. The transition
|
---|
88 | functions out of and into that state implement the Driver Binding Protocol.
|
---|
89 |
|
---|
90 | The lower three states characterize an existent driver instance and are all
|
---|
91 | states defined by the Simple Network Protocol. The transition functions between
|
---|
92 | them are member functions of the Simple Network Protocol.
|
---|
93 |
|
---|
94 | Each transition function validates its expected source state and its
|
---|
95 | parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect
|
---|
96 | from the controller unless it's in EfiSimpleNetworkStopped.
|
---|
97 |
|
---|
98 |
|
---|
99 | Driver instance states (Simple Network Protocol)
|
---|
100 | ------------------------------------------------
|
---|
101 |
|
---|
102 | In the EfiSimpleNetworkStopped state, the virtio-net device is (has been)
|
---|
103 | re-set. No resources are allocated for networking / traffic purposes. The MAC
|
---|
104 | address and other device attributes have been retrieved from the device (this
|
---|
105 | is necessary for completing the VirtioNetDriverBindingStart transition).
|
---|
106 |
|
---|
107 | The EfiSimpleNetworkStarted is completely identical to the
|
---|
108 | EfiSimpleNetworkStopped state for virtio-net, in the functional and
|
---|
109 | resource-usage sense. This state is mandated / provided by the Simple Network
|
---|
110 | Protocol for flexibility that the virtio-net driver doesn't exploit.
|
---|
111 |
|
---|
112 | In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown
|
---|
113 | SNP member function, and must therefore correspond to a hardware configuration
|
---|
114 | where "[it] is safe for another driver to initialize". (Clearly another UEFI
|
---|
115 | driver could not do that due to the exclusivity of the driver binding that
|
---|
116 | VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.)
|
---|
117 |
|
---|
118 | The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the
|
---|
119 | driver instance. Virtio and other resources required for network traffic have
|
---|
120 | been allocated, and the following SNP member functions are available (in
|
---|
121 | addition to VirtioNetShutdown which leaves the state):
|
---|
122 |
|
---|
123 | - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that
|
---|
124 | may have arrived asynchronously;
|
---|
125 |
|
---|
126 | - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous
|
---|
127 | transmission (meant to be used together with VirtioNetGetStatus);
|
---|
128 |
|
---|
129 | - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending
|
---|
130 | Tx packets;
|
---|
131 |
|
---|
132 | - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6
|
---|
133 | address into a multicast MAC address;
|
---|
134 |
|
---|
135 | - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast /
|
---|
136 | broadcast filter configuration (not their actual effect -- a more liberal
|
---|
137 | filter setting than requested is allowed by the UEFI specification).
|
---|
138 |
|
---|
139 | The following SNP member functions are not supported [SnpUnsupported.c]:
|
---|
140 |
|
---|
141 | - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop
|
---|
142 | from/to EfiSimpleNetworkInitialized);
|
---|
143 |
|
---|
144 | - VirtioNetStationAddress: assign a new MAC address to the virtio NIC,
|
---|
145 |
|
---|
146 | - VirtioNetStatistics: collect statistics,
|
---|
147 |
|
---|
148 | - VirtioNetNvData: access non-volatile data on the virtio NIC.
|
---|
149 |
|
---|
150 | Missing support for these functions is allowed by the UEFI specification and
|
---|
151 | doesn't seem to trip up higher level protocols.
|
---|
152 |
|
---|
153 |
|
---|
154 | Events and task priority levels
|
---|
155 | -------------------------------
|
---|
156 |
|
---|
157 | The UEFI specification defines a sophisticated mechanism for asynchronous
|
---|
158 | events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for
|
---|
159 | details). Such callbacks work like software interrupts, and some notion of
|
---|
160 | locking / masking is important to implement critical sections (atomic or
|
---|
161 | exclusive access to data or a device). This notion is defined as Task Priority
|
---|
162 | Levels.
|
---|
163 |
|
---|
164 | The virtio-net driver for OVMF must concern itself with events for two reasons:
|
---|
165 |
|
---|
166 | - The Simple Network Protocol provides its clients with a (non-optional) WAIT
|
---|
167 | type event called WaitForPacket: it allows them to check or wait for Rx
|
---|
168 | packets by polling or blocking on this event. (This functionality overlaps
|
---|
169 | with the Receive member function.) The event is available to clients starting
|
---|
170 | with EfiSimpleNetworkStopped (inclusive).
|
---|
171 |
|
---|
172 | The virtio-net driver is informed about such client polling or blockage by
|
---|
173 | receiving an asynchronous callback (a software interrupt). In the callback
|
---|
174 | function the driver must interrogate the driver instance state, and if it is
|
---|
175 | EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are
|
---|
176 | available for consumption. If so, it must signal the WaitForPacket WAIT type
|
---|
177 | event, waking the client.
|
---|
178 |
|
---|
179 | For simplicity and safety, all parts of the virtio-net driver that access any
|
---|
180 | bit of the driver instance (data or device) run at the TPL_CALLBACK level.
|
---|
181 | This is the highest level allowed for an SNP implementation, and all code
|
---|
182 | protected in this manner satisfies even stricter non-blocking requirements
|
---|
183 | than what's documented for TPL_CALLBACK.
|
---|
184 |
|
---|
185 | The task priority level for the WaitForPacket callback too is set by the
|
---|
186 | driver, the choice is TPL_CALLBACK again. This in effect serializes the
|
---|
187 | WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal"
|
---|
188 | parts of the driver.
|
---|
189 |
|
---|
190 | - According to the Driver Writer's Guide, a network driver should install a
|
---|
191 | callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY
|
---|
192 | type event). When the ExitBootServices() boot service has cleaned up internal
|
---|
193 | firmware state and is about to pass control to the OS, any network driver has
|
---|
194 | to stop any in-flight DMA transfers, lest it corrupts OS memory. For this
|
---|
195 | reason EXIT_BOOT_SERVICES is emitted and the network driver must abort
|
---|
196 | in-flight DMA transfers.
|
---|
197 |
|
---|
198 | This callback (VirtioNetExitBoot) is synchronized with the rest of the driver
|
---|
199 | code just the same as explained for WaitForPacket. In
|
---|
200 | EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data
|
---|
201 | transfer. After the callback returns, no further driver code is expected to
|
---|
202 | be scheduled.
|
---|
203 |
|
---|
204 |
|
---|
205 | Virtio internals -- Rx
|
---|
206 | ----------------------
|
---|
207 |
|
---|
208 | Requests (Rx and Tx alike) are always submitted by the guest and processed by
|
---|
209 | the host. For Tx, processing means transmission. For Rx, processing means
|
---|
210 | filling in the request with an incoming packet. Submitted requests exist on the
|
---|
211 | "Available Ring", and answered (processed) requests show up on the "Used Ring".
|
---|
212 |
|
---|
213 | Packet data includes the media (Ethernet) header: destination MAC, source MAC,
|
---|
214 | and Ethertype (14 bytes total).
|
---|
215 |
|
---|
216 | The following structures implement packet reception. Most of them are defined
|
---|
217 | in the Virtio specification, the only driver-specific trait here is the static
|
---|
218 | pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The
|
---|
219 | diagram is simplified.
|
---|
220 |
|
---|
221 | Available Index Available Index
|
---|
222 | last processed incremented
|
---|
223 | by the host by the guest
|
---|
224 | v -------> v
|
---|
225 | Available +-------+-------+-------+-------+-------+
|
---|
226 | Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx|
|
---|
227 | +-------+-------+-------+-------+-------+
|
---|
228 | =D6 =D2
|
---|
229 |
|
---|
230 | D2 D3 D4 D5 D6 D7
|
---|
231 | Descr. +----------+----------++----------+----------++----------+----------+
|
---|
232 | Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx|
|
---|
233 | +----------+----------++----------+----------++----------+----------+
|
---|
234 | =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7
|
---|
235 |
|
---|
236 |
|
---|
237 | A2 A3 A4 A5 A6 A7
|
---|
238 | Receive +---------------+---------------+---------------+
|
---|
239 | Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet|
|
---|
240 | Area +---------------+---------------+---------------+
|
---|
241 |
|
---|
242 | Used Index Used Index incremented
|
---|
243 | last processed by the guest by the host
|
---|
244 | v -------> v
|
---|
245 | Used +-----------+-----------+-----------+-----------+-----------+
|
---|
246 | Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|
|
---|
247 | +-----------+-----------+-----------+-----------+-----------+
|
---|
248 | =D4
|
---|
249 |
|
---|
250 | In VirtioNetInitRx, the guest allocates the fixed size Receive Destination
|
---|
251 | Area, which accommodates all packets delivered asynchronously by the host. To
|
---|
252 | each packet, a slice of this area is dedicated; each slice is further
|
---|
253 | subdivided into virtio-net request header and network packet data. The
|
---|
254 | (device-physical) addresses of these sub-slices are denoted with A2, A3, A4 and
|
---|
255 | so on. Importantly, an even-subscript "A" always belongs to a virtio-net
|
---|
256 | request header, while an odd-subscript "A" always belongs to a packet
|
---|
257 | sub-slice.
|
---|
258 |
|
---|
259 | Furthermore, the guest lays out a static pattern in the Descriptor Table. For
|
---|
260 | each packet that can be in-flight or already arrived from the host,
|
---|
261 | VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N,
|
---|
262 | the Nth descriptor chain is set up as follows:
|
---|
263 |
|
---|
264 | - the first (=head) descriptor, with even index, points to the fixed-size
|
---|
265 | sub-slice receiving the virtio-net request header,
|
---|
266 |
|
---|
267 | - the second descriptor (with odd index) points to the fixed (1514 byte) size
|
---|
268 | sub-slice receiving the packet data,
|
---|
269 |
|
---|
270 | - a link from the first (head) descriptor in the chain is established to the
|
---|
271 | second (tail) descriptor in the chain.
|
---|
272 |
|
---|
273 | Finally, the guest populates the Available Ring with the indices of the head
|
---|
274 | descriptors. All descriptor indices on both the Available Ring and the Used
|
---|
275 | Ring are even.
|
---|
276 |
|
---|
277 | Packet reception occurs as follows:
|
---|
278 |
|
---|
279 | - The host consumes a descriptor index off the Available Ring. This index is
|
---|
280 | even (=2*N), and fingers the head descriptor of the chain belonging to packet
|
---|
281 | N.
|
---|
282 |
|
---|
283 | - The host reads the descriptors D(2*N) and -- following the Next link there
|
---|
284 | --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the
|
---|
285 | packet data at A(2*N+1).
|
---|
286 |
|
---|
287 | - The host places the index of the head descriptor, 2*N, onto the Used Ring,
|
---|
288 | and sets the Len field in the same Used Ring Element to the total number of
|
---|
289 | bytes transferred for the entire descriptor chain. This enables the guest to
|
---|
290 | identify the length of Rx packets.
|
---|
291 |
|
---|
292 | - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it
|
---|
293 | copies the data out to the caller, and recycles the index of the head
|
---|
294 | descriptor (ie. 2*N) to the Available Ring.
|
---|
295 |
|
---|
296 | - Because the host can process (answer) Rx requests in any order theoretically,
|
---|
297 | the order of head descriptor indices on each of the Available Ring and the
|
---|
298 | Used Ring is virtually random. (Except right after the initial population in
|
---|
299 | VirtioNetInitRx, when the Available Ring is full and increasing, and the Used
|
---|
300 | Ring is empty.)
|
---|
301 |
|
---|
302 | - If the Available Ring is empty, the host is forced to drop packets. If the
|
---|
303 | Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet
|
---|
304 | available).
|
---|
305 |
|
---|
306 |
|
---|
307 | Virtio internals -- Tx
|
---|
308 | ----------------------
|
---|
309 |
|
---|
310 | The transmission structure erected by VirtioNetInitTx is similar, it differs
|
---|
311 | in the following:
|
---|
312 |
|
---|
313 | - There is no Receive Destination Area.
|
---|
314 |
|
---|
315 | - Each head descriptor, D(2*N), points to a read-only virtio-net request header
|
---|
316 | that is shared by all of the head descriptors. This virtio-net request header
|
---|
317 | is never modified by the host.
|
---|
318 |
|
---|
319 | - Each tail descriptor is re-pointed to the device-mapped address of the
|
---|
320 | caller-supplied packet buffer whenever VirtioNetTransmit places the
|
---|
321 | corresponding head descriptor on the Available Ring. A reverse mapping, from
|
---|
322 | the device-mapped address to the caller-supplied packet address, is saved in
|
---|
323 | an associative data structure that belongs to the driver instance.
|
---|
324 |
|
---|
325 | - Per spec, the caller is responsible to hang on to the unmodified packet
|
---|
326 | buffer until it is reported transmitted by VirtioNetGetStatus.
|
---|
327 |
|
---|
328 | Steps of packet transmission:
|
---|
329 |
|
---|
330 | - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor
|
---|
331 | chains by keeping the indices of their head descriptors in a stack that is
|
---|
332 | private to the driver instance. All elements of the stack are even.
|
---|
333 |
|
---|
334 | - If the stack is empty (that is, each descriptor chain, in isolation, is
|
---|
335 | either pending transmission, or has been processed by the host but not
|
---|
336 | yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns
|
---|
337 | EFI_NOT_READY.
|
---|
338 |
|
---|
339 | - Otherwise the index of a free chain's head descriptor is popped from the
|
---|
340 | stack. The linked tail descriptor is re-pointed as discussed above. The head
|
---|
341 | descriptor's index is pushed on the Available Ring.
|
---|
342 |
|
---|
343 | - The host moves the head descriptor index from the Available Ring to the Used
|
---|
344 | Ring when it transmits the packet.
|
---|
345 |
|
---|
346 | - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the
|
---|
347 | function reports no Tx completion. Otherwise, a head descriptor's index is
|
---|
348 | consumed from the Used Ring and recycled to the private stack. The client
|
---|
349 | code's original packet buffer address is calculated by fetching the
|
---|
350 | device-mapped address from the tail descriptor (where it has been stored at
|
---|
351 | VirtioNetTransmit time), and by looking up the device-mapped address in the
|
---|
352 | associative data structure. The reverse-mapped packet buffer address is
|
---|
353 | returned to the caller.
|
---|
354 |
|
---|
355 | - The Len field of the Used Ring Element is not checked. The host is assumed to
|
---|
356 | have transmitted the entire packet -- VirtioNetTransmit had forced it below
|
---|
357 | 1514 bytes (inclusive). The Virtio specification suggests this packet size is
|
---|
358 | always accepted (and a lower MTU could be encountered on any later hop as
|
---|
359 | well). Additionally, there's no good way to report a short transmit via
|
---|
360 | VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification
|
---|
361 | and higher level protocols could interpret it as a fatal condition.
|
---|
362 |
|
---|
363 | - The host can theoretically reorder head descriptor indices when moving them
|
---|
364 | from the Available Ring to the Used Ring (out of order transmission). Because
|
---|
365 | of this (and the choice of a stack over a list for free descriptor chain
|
---|
366 | tracking) the order of head descriptor indices on either Ring is
|
---|
367 | unpredictable.
|
---|