1QUIC I/O Architecture
2=====================
3
4This document discusses possible implementation options for the I/O architecture
5internal to the libssl QUIC implementation, discusses the underlying design
6constraints driving this decision and introduces the resulting I/O architecture.
7It also identifies potential hazards to existing applications, and identifies
8how those hazards are mitigated.
9
10Objectives
11----------
12
13The [requirements for QUIC](./quic-requirements.md) which have formed the basis
14for implementation include the following requirements:
15
16- The application must have the ability to be in control of the event loop
17  without requiring callbacks to process the various events. An application must
18  also have the ability to operate in “blocking” mode.
19
20- High performance applications (primarily server based) using existing libssl
21  APIs; using custom network interaction BIOs in order to get the best
22  performance at a network level as well as OS interactions (IO handling, thread
23  handling, using fibres). Would prefer to use the existing APIs - they don’t
24  want to throw away what they’ve got. Where QUIC necessitates a change they
25  would be willing to make minor changes.
26
27As such, there are several objectives for the I/O architecture of the QUIC
28implementation:
29
30 - We want to support both blocking and non-blocking semantics
31   for application use of the libssl APIs.
32
33 - In the case of non-blocking applications, it must be possible
34   for an application to do its own polling and make its own event
35   loop.
36
37 - We want to support custom BIOs on the network side and to the extent
38   feasible, minimise the level of adaptation needed for any custom BIOs already
39   in use on the network side. More generally, the integrity of the BIO
40   abstraction layer should be preserved.
41
42QUIC-Related Requirements
43-------------------------
44
45Note that implementation of QUIC will require that the underlying network BIO
46passed to the QUIC implementation be configured to support datagram semantics
47instead of bytestream semantics as has been the case with traditional TLS
48over TCP. This will require applications using custom BIOs on the network side
49to make substantial changes to the implementation of those custom BIOs to model
50datagram semantics. These changes are not minor, but there is no way around this
51requirement.
52
53It should also be noted that implementation of QUIC requires handling of timer
54events as well as the circumstances where a network socket becomes readable or
55writable. In many cases we need to handle these events simultaneously (e.g. wait
56until a socket becomes readable, or writable, or a timeout expires, whichever
57comes first).
58
59Note that the discussion in this document primarily concerns usage of blocking
60vs. non-blocking I/O in the interface between the QUIC implementation and an
61underlying BIO provided to the QUIC implementation to provide it access to the
62network. This is independent of and orthogonal to the application interface to
63libssl, which will support both blocking and non-blocking I/O.
64
65Blocking vs. Non-Blocking Modes in Underlying Network BIOs
66----------------------------------------------------------
67
68The above constraints make it effectively a requirement that non-blocking I/O be
69used for the calls to the underlying network BIOs. To illustrate this point, we
70first consider how QUIC might be implemented using blocking network I/O
71internally.
72
73To function correctly and provide blocking semantics at the application level,
74our QUIC implementation must be able to block such that it can respond to any of
75the following events for the underlying network read and write BIOs immediately:
76
77- The underlying network write BIO becomes writeable;
78- The underlying network read BIO becomes readable;
79- A timeout expires.
80
81### Blocking sockets and select(3)
82
83Firstly, consider how this might be accomplished using the Berkeley sockets API.
84Blocking on all three wakeup conditions listed above would require use of an API
85such as select(3) or poll(3), regardless of whether the network socket is
86configured in blocking mode or not.
87
88While in principle APIs such as select(3) can be used with a socket in blocking
89mode, this is not an advisable usage mode. If a socket is in blocking mode,
90calls to send(3) or recv(3) may block for some arbitrary period of time, meaning
91that our QUIC implementation cannot handle incoming data (if we are blocked on
92send), send outgoing data (if we are blocked on receive), or handle timeout
93events.
94
95Though it can be argued that a select(3) call indicating readability or
96writeability should guarantee that a subsequent send(3) or recv(3) call will not
97block, there are several reasons why this is an extremely undesirable solution:
98
99- It is quite likely that there are buggy OSes out there which perform spurious
100  wakeups from select(3).
101
102- The fact that a socket is writeable does not necessarily mean that a datagram
103  of the size we wish to send is writeable, so a send(3) call could block
104  anyway.
105
106- This usage pattern precludes multithreaded use barring some locking scheme
107  due to the possibility of other threads racing between the call to select(3)
108  and the subsequent I/O call. This undermines our intentions to support
109  multi-threaded network I/O on the backend.
110
111Moreover, our QUIC implementation will not drive the Berkeley sockets API
112directly but uses the BIO abstraction to access the network, so these issues are
113then compounded by the limitations of our existing BIO interfaces. We do not
114have a BIO interface which provides for select(3)-like functionality or which
115can implement the required semantics above.
116
117Moreover, even if we used select(3) directly, select(3) only gives us a
118guarantee (under a non-buggy OS) that a single syscall will not block, however
119we have no guarantee in the API contract for BIO_read(3) or BIO_write(3) that
120any given BIO implementation has such a BIO call correspond to only a single
121system call (or any system call), so this does not work either. Therefore,
122trying to implement QUIC on top of blocking I/O in this way would require
123violating the BIO abstraction layer, and would not work with custom BIOs (even
124if the poll descriptor concept discussed below were adopted).
125
126### Blocking sockets and threads
127
128Another conceptual possibility is that blocking calls could be kept ongoing in
129parallel threads. Under this model, there would be three threads:
130
131- a thread which exists solely to execute blocking calls to the `BIO_write` of
132  an underlying network BIO,
133- a thread which exists solely to execute blocking calls to the `BIO_read` of an
134  underlying network BIO,
135- a thread which exists solely to wait for and dispatch timeout events.
136
137This could potentially be reduced to two threads if it is assumed that
138`BIO_write` calls do not take an excessive amount of time.
139
140The premise here is that the front-end I/O API (`SSL_read`, `SSL_write`, etc.)
141would coordinate and synchronise with these background worker threads via
142threading primitives such as conditional variables, etc.
143
144This has a large number of disadvantages:
145
146- There is a hard requirement for threading functionality in order to be
147  able to support blocking semantics at the application level. Applications
148  which require blocking semantics would only be able to function in thread
149  assisted mode. In environments where threading support is not available or
150  desired, our APIs would only be usable in a non-blocking fashion.
151
152- Several threads are spawned which the application is not in control of.
153  This undermines our general approach of providing the application with control
154  over OpenSSL's use of resources, such as allowing the application to do its
155  own polling or provide its own allocators.
156
157  At a minimum for a client, there must be two threads per connection. This
158  means if an application opens many outgoing connections, there will need
159  to be `2n` extra threads spawned.
160
161- By blocking in `BIO_write` calls, this precludes correct implementation of
162  QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended
163  to be transmitted as soon as they are generated. QUIC packets contain fields
164  such as the ACK Delay value, which is intended to describe the time between a
165  packet being received and a return packet being generated. Correct calculation
166  of this field is necessary to correct calculation of connection RTT. It is
167  therefore important to only generate packets when they are ready to be sent,
168  otherwise suboptimal performance will result. This is a usage model which
169  aligns optimally to non-blocking I/O and which cannot be accommodated
170  by blocking I/O.
171
172- Since existing custom BIOs will not be expecting concurrent `BIO_read` and
173  `BIO_write` calls, they will need to be adapted to support this, which is
174  likely to require substantial rework of those custom BIOs (trivial locking of
175  calls obviously does not work since both of these calls must be able to block
176  on network I/O simultaneously).
177
178Moreover, this does not appear to be a realistically implementable approach:
179
180- The question is posed of how to handle connection teardown, which does not
181  seem to be solvable. If parallel threads are blocked in blocking `BIO_read`
182  and `BIO_write` calls on some underlying network BIO, there needs to be some
183  way to force these calls to return once `SSL_free` is called and we need to
184  tear down the connection. However, the BIO interface does not provide
185  any way to do this. *At best* we might assume the BIO is a `BIO_s_dgram`
186  (but cannot assume this in the general case), but even then we can only
187  accomplish teardown by violating the BIO abstraction and closing the
188  underlying socket.
189
190  This is the only portable way to ensure that a recv(3) call to the same socket
191  returns. This obviously is a highly application-visible change (and is likely
192  to be far more disruptive than configuring the socket into non-blocking mode).
193
194  Moreover, it is not workable anyway because it only works for a socket-based
195  BIO and violates the BIO abstraction. For BIOs in general, there does not
196  appear to be any viable solution to the teardown issue.
197
198Even if this approach were successfully implemented, applications will still
199need to change to using network BIOs with datagram semantics. For applications
200using custom BIOs, this is likely to require substantial rework of those BIOs.
201There is no possible way around this. Thus, even if this solution were adopted
202(notwithstanding the issues which preclude this noted above) for the purposes of
203accommodating applications using custom network BIOs in a blocking mode, these
204applications would still have to completely rework their implementation of those
205BIOs. In any case, it is expected to be comparatively rare that sophisticated
206applications implementing their own custom BIOs will do so in a blocking mode.
207
208### Use of non-blocking I/O
209
210By comparison, use of non-blocking I/O and select(3) or similar APIs on the
211network side makes satisfying our requirements for QUIC easy, and also allows
212our internal approach to I/O to be flexibly adapted in the future as
213requirements may evolve.
214
215This is also the approach used by all other known QUIC implementations; it is
216highly unlikely that any QUIC implementations exist which use blocking network
217I/O, as (as mentioned above) it would lead to suboptimal performance due to the
218ACK delay issue.
219
220Note that this is orthogonal to whether we provide blocking I/O semantics to the
221application. We can use blocking I/O internally while using this to provide
222either blocking or non-blocking semantics to the application, based on what the
223application requests.
224
225This approach in general requires that a network socket be configured in
226non-blocking mode. Though some OSes support a `MSG_DONTWAIT` flag which allows a
227single I/O operation to be made non-blocking, not all OSes support this (e.g.
228Windows), thus this cannot be relied on. As such, we need to configure any
229socket FD we use into non-blocking mode.
230
231Of the approaches outlined in this document, the use of non-blocking I/O has the
232fewest disadvantages and is the only approach which appears to actually be
233implementable in practice. Moreover, most of the disadvantages can be readily
234mitigated:
235
236  - We rely on having a select(3) or poll(3) like function available from the
237    OS.
238
239    However:
240
241    - Firstly, we already rely on select(3) in our code, at least in
242      non-`no-sock` builds, so this does not appear to raise any portability
243      issues;
244
245    - Secondly, we have the option of providing a custom poller interface which
246      allows an application to provide its own implementation of a
247      select(3)-like function. In fact, this has the potential to be quite
248      powerful and would allow the application to implement its own pollable
249      BIOs, and therefore perform blocking I/O on top of any custom BIO.
250
251      For example, while historically none of our own memory-based BIOs have
252      supported blocking semantics, a sophisticated application could if it
253      wished choose to implement a custom blocking memory BIO and implement a
254      custom poller which synchronises using a custom poll descriptor based
255      around condition variables rather than sockets. Thus this scheme is
256      highly flexible.
257
258      (It is worth noting also that the implementation of blocking semantics at
259      the application level also does not rely on any privileged access to the
260      internals of the QUIC implementation and an application could if it wished
261      build blocking semantics out of a non-blocking QUIC instance; this is not
262      particularly difficult, though providing custom pollers here would mean
263      there should be no need for an application to do so.)
264
265  - Configuring a socket into non-blocking mode might confuse an application.
266
267    However:
268
269    - Applications will already have to make changes to any network-side BIOs,
270      for example switching from a `BIO_s_socket` to a `BIO_s_dgram`, or from a
271      BIO pair to a `BIO_s_dgram_pair`. Custom BIOs will need to be
272      substantially reworked to switch from bytestream semantics to datagram
273      semantics. Such applications will already need substantial changes, and
274      this is unavoidable.
275
276      Of course, application impacts and migration guidance can (and will) all
277      be documented.
278
279    - In order for an application to be confused by us putting a socket into
280      non-blocking mode, it would need to be trying to use the socket in some
281      way. But it is not possible for an application to pass a socket to our
282      QUIC implementation, and also try to use the socket directly, and have
283      QUIC still work. Using QUIC necessarily requires that an application not
284      also be trying to make use of the same socket.
285
286    - There are some circumstances where an application might want to multiplex
287      other protocols onto the same UDP socket, for example with protocols like
288      RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit.
289      However, these use cases cannot be supported without explicit assistance
290      from a QUIC implementation and this use case cannot be facilitated by
291      simply sharing a network socket, as incoming datagrams will not be routed
292      correctly. (We may offer some functionality in future to allow this to be
293      coordinated but this is not for MVP.) Thus this also is not a concern.
294      Moreover, it is extremely unlikely that any such applications are using
295      sockets in blocking mode anyway.
296
297   - The poll descriptor interface adds complexity to the BIO interface.
298
299Advantages:
300
301  - An application retains full control of its event loop in non-blocking mode.
302
303    When using libssl in application-level blocking mode, via a custom poller
304    interface, the application would actually be able to exercise more control
305    over I/O than it actually is at present when using libssl in blocking mode.
306
307  - Feasible to implement and already working in tests.
308    Minimises further development needed to ship.
309
310  - Does not rely on creating threads and can support blocking I/O at the
311    application level without relying on thread assisted mode.
312
313  - Does not require an application-provided network-side custom BIO to be
314    reworked to support concurrent calls to it.
315
316  - The poll descriptor interface will allow applications to implement custom
317    modes of polling in the future (e.g. an application could even building
318    blocking application-level I/O on top of a on a custom memory-based BIO
319    using condition variables, if it wished). This is actually more flexible
320    than the current TLS stack, which cannot be used in blocking mode when used
321    with a memory-based BIO.
322
323  - Allows performance-optimal implementation of QUIC RFC requirements.
324
325  - Ensures our internal I/O architecture remains flexible for future evolution
326    without breaking compatibility in the future.
327
328Use of Internal Non-Blocking I/O
329--------------------------------
330
331Based on the above evaluation, implementation has been undertaken using
332non-blocking I/O internally. Applications can use blocking or non-blocking I/O
333at the libssl API level. Network-level BIOs must operate in a non-blocking mode
334or be configurable by QUIC to this end.
335
336![Block Diagram](images/quic-io-arch-1.png "Block Diagram")
337
338### Support of arbitrary BIOs
339
340We need to support not just socket FDs but arbitrary BIOs as the basis for the
341use of QUIC. The use of QUIC with e.g. `BIO_s_dgram_pair`, a bidirectional
342memory buffer with datagram semantics, is to be supported as part of MVP. This
343must be reconciled with the desire to support application-managed event loops.
344
345Broadly, the intention so far has been to enable the use of QUIC with an
346application event loop in application-level non-blocking mode by exposing an
347appropriate OS-level synchronisation primitive to the application. On \*NIX
348platforms, this essentially means we provide the application with:
349
350  - An FD which should be polled for readability, writability, or both; and
351  - A deadline (if any is currently applicable).
352
353Once either of these conditions is met, the QUIC state machine can be
354(potentially) advanced meaningfully, and the application is expected to reenter
355the QUIC state machine by calling `SSL_tick()` (or `SSL_read()` or
356`SSL_write()`).
357
358This model is readily supported when the read and write BIOs we are provided
359with are socket BIOs:
360
361  - The read-pollable FD is the FD of the read BIO.
362  - The write-pollable FD is the FD of the write BIO.
363
364However, things become more complex when we are dealing with memory-based BIOs
365such as `BIO_dgram_pair` which do not naturally correspond to any OS primitive
366which can be used for synchronisation, or when we are dealing with an
367application-provided custom BIO.
368
369### Pollable and Non-Pollable BIOs
370
371In order to accommodate these various cases, we draw a distinction between
372pollable and non-pollable BIOs.
373
374  - A pollable BIO is a BIO which can provide some kind of OS-level
375    synchronisation primitive, which can be used to determine when
376    the BIO might be able to do useful work once more.
377
378  - A non-pollable BIO has no naturally associated OS-level synchronisation
379    primitive, but its state only changes in response to calls made to it (or to
380    a related BIO, such as the other end of a pair).
381
382#### Supporting Pollable BIOs
383
384“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use
385unified handle spaces (UNIX, Windows) though it is likely there are more obscure
386APIs on these platforms which have other handle spaces. However, this
387unification is not necessarily significant.
388
389For example, Windows sockets are kernel handles and thus like any other object
390they can be used with the generic Win32 `WaitForSingleObject()` API, but not in
391a useful manner; the generic readiness mechanism for WIndows handles is not
392plumbed in for socket handles, and so sockets are simply never considered ready
393for the purposes of this API, which will never return. Instead, the
394WinSock-specific `select()` call must be used. On the other hand, other kinds of
395synchronisation primitive like a Win32 Event must use `WaitForSingleObject()`.
396
397Thus while in theory most modern operating systems have unified handle spaces in
398practice there are substantial usage differences between different handle types.
399As such, an API to expose a synchronisation primitive should be of a tagged
400union design supporting possible variation.
401
402A BIO object will provide methods to retrieve a pollable OS-level
403synchronisation primitive which can be used to determine when the QUIC state
404machine can (potentially) do more work. This maintains the integrity of the BIO
405abstraction layer. Equivalent SSL object API calls which forward to the
406equivalent calls of the underlying network BIO will also be provided.
407
408The core mechanic is as follows:
409
410```c
411#define BIO_POLL_DESCRIPTOR_TYPE_NONE        0
412#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD     1
413#define BIO_POLL_DESCRIPTOR_CUSTOM_START     8192
414
415#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM       4
416
417typedef struct bio_poll_descriptor_st {
418    int type;
419    union {
420        int fd;
421        union {
422            void        *ptr;
423            uint64_t    u64;
424        } custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM];
425    } value;
426} BIO_POLL_DESCRIPTOR;
427
428int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
429int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
430
431int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
432int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
433```
434
435Currently only a single descriptor type is defined, which is a FD on \*NIX and a
436Winsock socket handle on Windows. These use the same type to minimise code
437changes needed on different platforms in the common case of an OS network
438socket. (Use of an `int` here is strictly incorrect for Windows; however, this
439style of usage is prevalent in the OpenSSL codebase, so for consistency we
440continue the pattern here.)
441
442Poll descriptor types at or above `BIO_POLL_DESCRIPTOR_CUSTOM_START` are
443reserved for application-defined use. The `value.custom` field of the
444`BIO_POLL_DESCRIPTOR` structure is provided for applications to store values of
445their choice in. An application is free to define the semantics.
446
447libssl will not know how to poll custom poll descriptors itself, thus these are
448only useful when the application will provide a custom poller function, which
449performs polling on behalf of libssl and which implements support for those
450custom poll descriptors.
451
452For `BIO_s_ssl`, the `BIO_get_[rw]poll_descriptor` functions are equivalent to
453the `SSL_get_[rw]poll_descriptor` functions. The `SSL_get_[rw]poll_descriptor`
454functions are equivalent to calling `BIO_get_[rw]poll_descriptor` on the
455underlying BIOs provided to the SSL object. For a socket BIO, this will likely
456just yield the socket's FD. For memory-based BIOs, see below.
457
458#### Supporting Non-Pollable BIOs
459
460Where we are provided with a non-pollable BIO, we cannot provide the application
461with any primitive used for synchronisation and it is assumed that the
462application will handle its own network I/O, for example via a
463`BIO_s_dgram_pair`.
464
465When libssl calls `BIO_get_[rw]poll_descriptor` on the underlying BIO, the call
466fails, indicating that a non-pollable BIO is being used. Thus, if an application
467calls `SSL_get_[rw]poll_descriptor`, that call also fails.
468
469There are various circumstances which need to be handled:
470
471  - The QUIC implementation wants to write data to the network but
472    is currently unable to (e.g. `BIO_s_dgram_pair` is full).
473
474    This is not hard as our internal TX record layer allows arbitrary buffering.
475    The only limit comes when QUIC flow control (which only applies to
476    application stream data) applies a limit; then calls to e.g. `SSL_write` we
477    must fail with `SSL_ERROR_WANT_WRITE`.
478
479  - The QUIC implementation wants to read data from the network
480    but is currently unable to (e.g. `BIO_s_dgram_pair` is empty).
481
482    Here calls like `SSL_read` need to fail with `SSL_ERROR_WANT_READ`; we
483    thereby support libssl's classic nonblocking I/O interface.
484
485It is worth noting that theoretically a memory-based BIO could be implemented
486which is pollable, for example using condition variables. An application could
487implement a custom BIO, custom poll descriptor and custom poller to facilitate
488this.
489
490### Configuration of Blocking vs. Non-Blocking Mode
491
492Traditionally an SSL object has operated either in blocking mode or non-blocking
493mode without requiring explicit configuration; if a socket returns EWOULDBLOCK
494or similar, it is handled appropriately, and if a socket call blocks, there is
495no issue. Since the QUIC implementation is building on non-blocking I/O, this
496implicit configuration of non-blocking mode is not feasible.
497
498Note that Windows does not have an API for determining whether a socket is in
499blocking mode, so it is not possible to use the initial state of an underlying
500socket to determine if the application wants to use non-blocking I/O or not.
501Moreover this would undermine the BIO abstraction.
502
503As such, an explicit call is introduced to configure an SSL (QUIC) object into
504non-blocking mode:
505
506```c
507int SSL_set_blocking_mode(SSL *s, int blocking);
508int SSL_get_blocking_mode(SSL *s);
509```
510
511Applications desiring non-blocking operation will need to call this API to
512configure a new QUIC connection accordingly. Blocking mode is chosen as the
513default for parity with traditional Berkeley sockets APIs and to make things
514simpler for blocking applications, which are likely to be seeking a simpler
515solution. However, blocking mode cannot be supported with a non-pollable BIO,
516and thus blocking mode defaults to off when used with such a BIO.
517
518A method is also needed for the QUIC implementation to inform an underlying BIO
519that it must not block. The SSL object will call this function when it is
520provided with an underlying BIO. For a socket BIO this can set the socket as
521non-blocking; for a memory-based BIO it is a no-op; for `BIO_s_ssl` it is
522equivalent to a call to `SSL_set_blocking_mode()`.
523
524### Internal Polling
525
526When blocking mode is configured, the QUIC implementation will call
527`BIO_get_[rw]poll_descriptor` on the underlying BIOs and use a suitable OS
528function (e.g. `select()`) or, if configured, custom poller function, to block.
529This will be implemented by an internal function which can accept up to two poll
530descriptors (one for the read BIO, one for the write BIO), which might be
531identical.
532
533Blocking mode cannot be used with a non-pollable underlying BIO. If
534`BIO_get[rw]poll_descriptor` is not implemented for either of the underlying
535read and write BIOs, blocking mode cannot be enabled and blocking mode defaults
536to off.
537