1QUIC I/O Architecture 2===================== 3 4This document discusses possible implementation options for the I/O architecture 5internal to the libssl QUIC implementation, discusses the underlying design 6constraints driving this decision and introduces the resulting I/O architecture. 7It also identifies potential hazards to existing applications, and identifies 8how those hazards are mitigated. 9 10Objectives 11---------- 12 13The [requirements for QUIC](./quic-requirements.md) which have formed the basis 14for implementation include the following requirements: 15 16- The application must have the ability to be in control of the event loop 17 without requiring callbacks to process the various events. An application must 18 also have the ability to operate in “blocking” mode. 19 20- High performance applications (primarily server based) using existing libssl 21 APIs; using custom network interaction BIOs in order to get the best 22 performance at a network level as well as OS interactions (IO handling, thread 23 handling, using fibres). Would prefer to use the existing APIs - they don’t 24 want to throw away what they’ve got. Where QUIC necessitates a change they 25 would be willing to make minor changes. 26 27As such, there are several objectives for the I/O architecture of the QUIC 28implementation: 29 30 - We want to support both blocking and non-blocking semantics 31 for application use of the libssl APIs. 32 33 - In the case of non-blocking applications, it must be possible 34 for an application to do its own polling and make its own event 35 loop. 36 37 - We want to support custom BIOs on the network side and to the extent 38 feasible, minimise the level of adaptation needed for any custom BIOs already 39 in use on the network side. More generally, the integrity of the BIO 40 abstraction layer should be preserved. 41 42QUIC-Related Requirements 43------------------------- 44 45Note that implementation of QUIC will require that the underlying network BIO 46passed to the QUIC implementation be configured to support datagram semantics 47instead of bytestream semantics as has been the case with traditional TLS 48over TCP. This will require applications using custom BIOs on the network side 49to make substantial changes to the implementation of those custom BIOs to model 50datagram semantics. These changes are not minor, but there is no way around this 51requirement. 52 53It should also be noted that implementation of QUIC requires handling of timer 54events as well as the circumstances where a network socket becomes readable or 55writable. In many cases we need to handle these events simultaneously (e.g. wait 56until a socket becomes readable, or writable, or a timeout expires, whichever 57comes first). 58 59Note that the discussion in this document primarily concerns usage of blocking 60vs. non-blocking I/O in the interface between the QUIC implementation and an 61underlying BIO provided to the QUIC implementation to provide it access to the 62network. This is independent of and orthogonal to the application interface to 63libssl, which will support both blocking and non-blocking I/O. 64 65Blocking vs. Non-Blocking Modes in Underlying Network BIOs 66---------------------------------------------------------- 67 68The above constraints make it effectively a requirement that non-blocking I/O be 69used for the calls to the underlying network BIOs. To illustrate this point, we 70first consider how QUIC might be implemented using blocking network I/O 71internally. 72 73To function correctly and provide blocking semantics at the application level, 74our QUIC implementation must be able to block such that it can respond to any of 75the following events for the underlying network read and write BIOs immediately: 76 77- The underlying network write BIO becomes writeable; 78- The underlying network read BIO becomes readable; 79- A timeout expires. 80 81### Blocking sockets and select(3) 82 83Firstly, consider how this might be accomplished using the Berkeley sockets API. 84Blocking on all three wakeup conditions listed above would require use of an API 85such as select(3) or poll(3), regardless of whether the network socket is 86configured in blocking mode or not. 87 88While in principle APIs such as select(3) can be used with a socket in blocking 89mode, this is not an advisable usage mode. If a socket is in blocking mode, 90calls to send(3) or recv(3) may block for some arbitrary period of time, meaning 91that our QUIC implementation cannot handle incoming data (if we are blocked on 92send), send outgoing data (if we are blocked on receive), or handle timeout 93events. 94 95Though it can be argued that a select(3) call indicating readability or 96writeability should guarantee that a subsequent send(3) or recv(3) call will not 97block, there are several reasons why this is an extremely undesirable solution: 98 99- It is quite likely that there are buggy OSes out there which perform spurious 100 wakeups from select(3). 101 102- The fact that a socket is writeable does not necessarily mean that a datagram 103 of the size we wish to send is writeable, so a send(3) call could block 104 anyway. 105 106- This usage pattern precludes multithreaded use barring some locking scheme 107 due to the possibility of other threads racing between the call to select(3) 108 and the subsequent I/O call. This undermines our intentions to support 109 multi-threaded network I/O on the backend. 110 111Moreover, our QUIC implementation will not drive the Berkeley sockets API 112directly but uses the BIO abstraction to access the network, so these issues are 113then compounded by the limitations of our existing BIO interfaces. We do not 114have a BIO interface which provides for select(3)-like functionality or which 115can implement the required semantics above. 116 117Moreover, even if we used select(3) directly, select(3) only gives us a 118guarantee (under a non-buggy OS) that a single syscall will not block, however 119we have no guarantee in the API contract for BIO_read(3) or BIO_write(3) that 120any given BIO implementation has such a BIO call correspond to only a single 121system call (or any system call), so this does not work either. Therefore, 122trying to implement QUIC on top of blocking I/O in this way would require 123violating the BIO abstraction layer, and would not work with custom BIOs (even 124if the poll descriptor concept discussed below were adopted). 125 126### Blocking sockets and threads 127 128Another conceptual possibility is that blocking calls could be kept ongoing in 129parallel threads. Under this model, there would be three threads: 130 131- a thread which exists solely to execute blocking calls to the `BIO_write` of 132 an underlying network BIO, 133- a thread which exists solely to execute blocking calls to the `BIO_read` of an 134 underlying network BIO, 135- a thread which exists solely to wait for and dispatch timeout events. 136 137This could potentially be reduced to two threads if it is assumed that 138`BIO_write` calls do not take an excessive amount of time. 139 140The premise here is that the front-end I/O API (`SSL_read`, `SSL_write`, etc.) 141would coordinate and synchronise with these background worker threads via 142threading primitives such as conditional variables, etc. 143 144This has a large number of disadvantages: 145 146- There is a hard requirement for threading functionality in order to be 147 able to support blocking semantics at the application level. Applications 148 which require blocking semantics would only be able to function in thread 149 assisted mode. In environments where threading support is not available or 150 desired, our APIs would only be usable in a non-blocking fashion. 151 152- Several threads are spawned which the application is not in control of. 153 This undermines our general approach of providing the application with control 154 over OpenSSL's use of resources, such as allowing the application to do its 155 own polling or provide its own allocators. 156 157 At a minimum for a client, there must be two threads per connection. This 158 means if an application opens many outgoing connections, there will need 159 to be `2n` extra threads spawned. 160 161- By blocking in `BIO_write` calls, this precludes correct implementation of 162 QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended 163 to be transmitted as soon as they are generated. QUIC packets contain fields 164 such as the ACK Delay value, which is intended to describe the time between a 165 packet being received and a return packet being generated. Correct calculation 166 of this field is necessary to correct calculation of connection RTT. It is 167 therefore important to only generate packets when they are ready to be sent, 168 otherwise suboptimal performance will result. This is a usage model which 169 aligns optimally to non-blocking I/O and which cannot be accommodated 170 by blocking I/O. 171 172- Since existing custom BIOs will not be expecting concurrent `BIO_read` and 173 `BIO_write` calls, they will need to be adapted to support this, which is 174 likely to require substantial rework of those custom BIOs (trivial locking of 175 calls obviously does not work since both of these calls must be able to block 176 on network I/O simultaneously). 177 178Moreover, this does not appear to be a realistically implementable approach: 179 180- The question is posed of how to handle connection teardown, which does not 181 seem to be solvable. If parallel threads are blocked in blocking `BIO_read` 182 and `BIO_write` calls on some underlying network BIO, there needs to be some 183 way to force these calls to return once `SSL_free` is called and we need to 184 tear down the connection. However, the BIO interface does not provide 185 any way to do this. *At best* we might assume the BIO is a `BIO_s_dgram` 186 (but cannot assume this in the general case), but even then we can only 187 accomplish teardown by violating the BIO abstraction and closing the 188 underlying socket. 189 190 This is the only portable way to ensure that a recv(3) call to the same socket 191 returns. This obviously is a highly application-visible change (and is likely 192 to be far more disruptive than configuring the socket into non-blocking mode). 193 194 Moreover, it is not workable anyway because it only works for a socket-based 195 BIO and violates the BIO abstraction. For BIOs in general, there does not 196 appear to be any viable solution to the teardown issue. 197 198Even if this approach were successfully implemented, applications will still 199need to change to using network BIOs with datagram semantics. For applications 200using custom BIOs, this is likely to require substantial rework of those BIOs. 201There is no possible way around this. Thus, even if this solution were adopted 202(notwithstanding the issues which preclude this noted above) for the purposes of 203accommodating applications using custom network BIOs in a blocking mode, these 204applications would still have to completely rework their implementation of those 205BIOs. In any case, it is expected to be comparatively rare that sophisticated 206applications implementing their own custom BIOs will do so in a blocking mode. 207 208### Use of non-blocking I/O 209 210By comparison, use of non-blocking I/O and select(3) or similar APIs on the 211network side makes satisfying our requirements for QUIC easy, and also allows 212our internal approach to I/O to be flexibly adapted in the future as 213requirements may evolve. 214 215This is also the approach used by all other known QUIC implementations; it is 216highly unlikely that any QUIC implementations exist which use blocking network 217I/O, as (as mentioned above) it would lead to suboptimal performance due to the 218ACK delay issue. 219 220Note that this is orthogonal to whether we provide blocking I/O semantics to the 221application. We can use blocking I/O internally while using this to provide 222either blocking or non-blocking semantics to the application, based on what the 223application requests. 224 225This approach in general requires that a network socket be configured in 226non-blocking mode. Though some OSes support a `MSG_DONTWAIT` flag which allows a 227single I/O operation to be made non-blocking, not all OSes support this (e.g. 228Windows), thus this cannot be relied on. As such, we need to configure any 229socket FD we use into non-blocking mode. 230 231Of the approaches outlined in this document, the use of non-blocking I/O has the 232fewest disadvantages and is the only approach which appears to actually be 233implementable in practice. Moreover, most of the disadvantages can be readily 234mitigated: 235 236 - We rely on having a select(3) or poll(3) like function available from the 237 OS. 238 239 However: 240 241 - Firstly, we already rely on select(3) in our code, at least in 242 non-`no-sock` builds, so this does not appear to raise any portability 243 issues; 244 245 - Secondly, we have the option of providing a custom poller interface which 246 allows an application to provide its own implementation of a 247 select(3)-like function. In fact, this has the potential to be quite 248 powerful and would allow the application to implement its own pollable 249 BIOs, and therefore perform blocking I/O on top of any custom BIO. 250 251 For example, while historically none of our own memory-based BIOs have 252 supported blocking semantics, a sophisticated application could if it 253 wished choose to implement a custom blocking memory BIO and implement a 254 custom poller which synchronises using a custom poll descriptor based 255 around condition variables rather than sockets. Thus this scheme is 256 highly flexible. 257 258 (It is worth noting also that the implementation of blocking semantics at 259 the application level also does not rely on any privileged access to the 260 internals of the QUIC implementation and an application could if it wished 261 build blocking semantics out of a non-blocking QUIC instance; this is not 262 particularly difficult, though providing custom pollers here would mean 263 there should be no need for an application to do so.) 264 265 - Configuring a socket into non-blocking mode might confuse an application. 266 267 However: 268 269 - Applications will already have to make changes to any network-side BIOs, 270 for example switching from a `BIO_s_socket` to a `BIO_s_dgram`, or from a 271 BIO pair to a `BIO_s_dgram_pair`. Custom BIOs will need to be 272 substantially reworked to switch from bytestream semantics to datagram 273 semantics. Such applications will already need substantial changes, and 274 this is unavoidable. 275 276 Of course, application impacts and migration guidance can (and will) all 277 be documented. 278 279 - In order for an application to be confused by us putting a socket into 280 non-blocking mode, it would need to be trying to use the socket in some 281 way. But it is not possible for an application to pass a socket to our 282 QUIC implementation, and also try to use the socket directly, and have 283 QUIC still work. Using QUIC necessarily requires that an application not 284 also be trying to make use of the same socket. 285 286 - There are some circumstances where an application might want to multiplex 287 other protocols onto the same UDP socket, for example with protocols like 288 RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit. 289 However, these use cases cannot be supported without explicit assistance 290 from a QUIC implementation and this use case cannot be facilitated by 291 simply sharing a network socket, as incoming datagrams will not be routed 292 correctly. (We may offer some functionality in future to allow this to be 293 coordinated but this is not for MVP.) Thus this also is not a concern. 294 Moreover, it is extremely unlikely that any such applications are using 295 sockets in blocking mode anyway. 296 297 - The poll descriptor interface adds complexity to the BIO interface. 298 299Advantages: 300 301 - An application retains full control of its event loop in non-blocking mode. 302 303 When using libssl in application-level blocking mode, via a custom poller 304 interface, the application would actually be able to exercise more control 305 over I/O than it actually is at present when using libssl in blocking mode. 306 307 - Feasible to implement and already working in tests. 308 Minimises further development needed to ship. 309 310 - Does not rely on creating threads and can support blocking I/O at the 311 application level without relying on thread assisted mode. 312 313 - Does not require an application-provided network-side custom BIO to be 314 reworked to support concurrent calls to it. 315 316 - The poll descriptor interface will allow applications to implement custom 317 modes of polling in the future (e.g. an application could even building 318 blocking application-level I/O on top of a on a custom memory-based BIO 319 using condition variables, if it wished). This is actually more flexible 320 than the current TLS stack, which cannot be used in blocking mode when used 321 with a memory-based BIO. 322 323 - Allows performance-optimal implementation of QUIC RFC requirements. 324 325 - Ensures our internal I/O architecture remains flexible for future evolution 326 without breaking compatibility in the future. 327 328Use of Internal Non-Blocking I/O 329-------------------------------- 330 331Based on the above evaluation, implementation has been undertaken using 332non-blocking I/O internally. Applications can use blocking or non-blocking I/O 333at the libssl API level. Network-level BIOs must operate in a non-blocking mode 334or be configurable by QUIC to this end. 335 336![Block Diagram](images/quic-io-arch-1.png "Block Diagram") 337 338### Support of arbitrary BIOs 339 340We need to support not just socket FDs but arbitrary BIOs as the basis for the 341use of QUIC. The use of QUIC with e.g. `BIO_s_dgram_pair`, a bidirectional 342memory buffer with datagram semantics, is to be supported as part of MVP. This 343must be reconciled with the desire to support application-managed event loops. 344 345Broadly, the intention so far has been to enable the use of QUIC with an 346application event loop in application-level non-blocking mode by exposing an 347appropriate OS-level synchronisation primitive to the application. On \*NIX 348platforms, this essentially means we provide the application with: 349 350 - An FD which should be polled for readability, writability, or both; and 351 - A deadline (if any is currently applicable). 352 353Once either of these conditions is met, the QUIC state machine can be 354(potentially) advanced meaningfully, and the application is expected to reenter 355the QUIC state machine by calling `SSL_tick()` (or `SSL_read()` or 356`SSL_write()`). 357 358This model is readily supported when the read and write BIOs we are provided 359with are socket BIOs: 360 361 - The read-pollable FD is the FD of the read BIO. 362 - The write-pollable FD is the FD of the write BIO. 363 364However, things become more complex when we are dealing with memory-based BIOs 365such as `BIO_dgram_pair` which do not naturally correspond to any OS primitive 366which can be used for synchronisation, or when we are dealing with an 367application-provided custom BIO. 368 369### Pollable and Non-Pollable BIOs 370 371In order to accommodate these various cases, we draw a distinction between 372pollable and non-pollable BIOs. 373 374 - A pollable BIO is a BIO which can provide some kind of OS-level 375 synchronisation primitive, which can be used to determine when 376 the BIO might be able to do useful work once more. 377 378 - A non-pollable BIO has no naturally associated OS-level synchronisation 379 primitive, but its state only changes in response to calls made to it (or to 380 a related BIO, such as the other end of a pair). 381 382#### Supporting Pollable BIOs 383 384“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use 385unified handle spaces (UNIX, Windows) though it is likely there are more obscure 386APIs on these platforms which have other handle spaces. However, this 387unification is not necessarily significant. 388 389For example, Windows sockets are kernel handles and thus like any other object 390they can be used with the generic Win32 `WaitForSingleObject()` API, but not in 391a useful manner; the generic readiness mechanism for WIndows handles is not 392plumbed in for socket handles, and so sockets are simply never considered ready 393for the purposes of this API, which will never return. Instead, the 394WinSock-specific `select()` call must be used. On the other hand, other kinds of 395synchronisation primitive like a Win32 Event must use `WaitForSingleObject()`. 396 397Thus while in theory most modern operating systems have unified handle spaces in 398practice there are substantial usage differences between different handle types. 399As such, an API to expose a synchronisation primitive should be of a tagged 400union design supporting possible variation. 401 402A BIO object will provide methods to retrieve a pollable OS-level 403synchronisation primitive which can be used to determine when the QUIC state 404machine can (potentially) do more work. This maintains the integrity of the BIO 405abstraction layer. Equivalent SSL object API calls which forward to the 406equivalent calls of the underlying network BIO will also be provided. 407 408The core mechanic is as follows: 409 410```c 411#define BIO_POLL_DESCRIPTOR_TYPE_NONE 0 412#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD 1 413#define BIO_POLL_DESCRIPTOR_CUSTOM_START 8192 414 415#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM 4 416 417typedef struct bio_poll_descriptor_st { 418 int type; 419 union { 420 int fd; 421 union { 422 void *ptr; 423 uint64_t u64; 424 } custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM]; 425 } value; 426} BIO_POLL_DESCRIPTOR; 427 428int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc); 429int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc); 430 431int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc); 432int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc); 433``` 434 435Currently only a single descriptor type is defined, which is a FD on \*NIX and a 436Winsock socket handle on Windows. These use the same type to minimise code 437changes needed on different platforms in the common case of an OS network 438socket. (Use of an `int` here is strictly incorrect for Windows; however, this 439style of usage is prevalent in the OpenSSL codebase, so for consistency we 440continue the pattern here.) 441 442Poll descriptor types at or above `BIO_POLL_DESCRIPTOR_CUSTOM_START` are 443reserved for application-defined use. The `value.custom` field of the 444`BIO_POLL_DESCRIPTOR` structure is provided for applications to store values of 445their choice in. An application is free to define the semantics. 446 447libssl will not know how to poll custom poll descriptors itself, thus these are 448only useful when the application will provide a custom poller function, which 449performs polling on behalf of libssl and which implements support for those 450custom poll descriptors. 451 452For `BIO_s_ssl`, the `BIO_get_[rw]poll_descriptor` functions are equivalent to 453the `SSL_get_[rw]poll_descriptor` functions. The `SSL_get_[rw]poll_descriptor` 454functions are equivalent to calling `BIO_get_[rw]poll_descriptor` on the 455underlying BIOs provided to the SSL object. For a socket BIO, this will likely 456just yield the socket's FD. For memory-based BIOs, see below. 457 458#### Supporting Non-Pollable BIOs 459 460Where we are provided with a non-pollable BIO, we cannot provide the application 461with any primitive used for synchronisation and it is assumed that the 462application will handle its own network I/O, for example via a 463`BIO_s_dgram_pair`. 464 465When libssl calls `BIO_get_[rw]poll_descriptor` on the underlying BIO, the call 466fails, indicating that a non-pollable BIO is being used. Thus, if an application 467calls `SSL_get_[rw]poll_descriptor`, that call also fails. 468 469There are various circumstances which need to be handled: 470 471 - The QUIC implementation wants to write data to the network but 472 is currently unable to (e.g. `BIO_s_dgram_pair` is full). 473 474 This is not hard as our internal TX record layer allows arbitrary buffering. 475 The only limit comes when QUIC flow control (which only applies to 476 application stream data) applies a limit; then calls to e.g. `SSL_write` we 477 must fail with `SSL_ERROR_WANT_WRITE`. 478 479 - The QUIC implementation wants to read data from the network 480 but is currently unable to (e.g. `BIO_s_dgram_pair` is empty). 481 482 Here calls like `SSL_read` need to fail with `SSL_ERROR_WANT_READ`; we 483 thereby support libssl's classic nonblocking I/O interface. 484 485It is worth noting that theoretically a memory-based BIO could be implemented 486which is pollable, for example using condition variables. An application could 487implement a custom BIO, custom poll descriptor and custom poller to facilitate 488this. 489 490### Configuration of Blocking vs. Non-Blocking Mode 491 492Traditionally an SSL object has operated either in blocking mode or non-blocking 493mode without requiring explicit configuration; if a socket returns EWOULDBLOCK 494or similar, it is handled appropriately, and if a socket call blocks, there is 495no issue. Since the QUIC implementation is building on non-blocking I/O, this 496implicit configuration of non-blocking mode is not feasible. 497 498Note that Windows does not have an API for determining whether a socket is in 499blocking mode, so it is not possible to use the initial state of an underlying 500socket to determine if the application wants to use non-blocking I/O or not. 501Moreover this would undermine the BIO abstraction. 502 503As such, an explicit call is introduced to configure an SSL (QUIC) object into 504non-blocking mode: 505 506```c 507int SSL_set_blocking_mode(SSL *s, int blocking); 508int SSL_get_blocking_mode(SSL *s); 509``` 510 511Applications desiring non-blocking operation will need to call this API to 512configure a new QUIC connection accordingly. Blocking mode is chosen as the 513default for parity with traditional Berkeley sockets APIs and to make things 514simpler for blocking applications, which are likely to be seeking a simpler 515solution. However, blocking mode cannot be supported with a non-pollable BIO, 516and thus blocking mode defaults to off when used with such a BIO. 517 518A method is also needed for the QUIC implementation to inform an underlying BIO 519that it must not block. The SSL object will call this function when it is 520provided with an underlying BIO. For a socket BIO this can set the socket as 521non-blocking; for a memory-based BIO it is a no-op; for `BIO_s_ssl` it is 522equivalent to a call to `SSL_set_blocking_mode()`. 523 524### Internal Polling 525 526When blocking mode is configured, the QUIC implementation will call 527`BIO_get_[rw]poll_descriptor` on the underlying BIOs and use a suitable OS 528function (e.g. `select()`) or, if configured, custom poller function, to block. 529This will be implemented by an internal function which can accept up to two poll 530descriptors (one for the read BIO, one for the write BIO), which might be 531identical. 532 533Blocking mode cannot be used with a non-pollable underlying BIO. If 534`BIO_get[rw]poll_descriptor` is not implemented for either of the underlying 535read and write BIOs, blocking mode cannot be enabled and blocking mode defaults 536to off. 537