1Datagram BIO API revisions for sendmmsg/recvmmsg
2================================================
3
4We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the
5eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be
6sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2).
7
8The adopted design
9------------------
10
11### Design decisions
12
13The adopted design makes the following design decisions:
14
15- We use a sendmmsg/recvmmsg-like API. The alternative API was not considered
16  for adoption because it is an explicit goal that the adopted API be suitable
17  for concurrent use on the same BIO.
18
19- We define our own structures rather than using the OS's `struct mmsghdr`.
20  The motivations for this are:
21
22  - It ensures portability between OSes and allows the API to be used
23    on OSes which do not support `sendmmsg` or `sendmsg`.
24
25  - It allows us to use structures in keeping with OpenSSL's existing
26    abstraction layers (e.g. `BIO_ADDR` rather than `struct sockaddr`).
27
28  - We do not have to expose functionality which we cannot guarantee
29    we can support on all platforms (for example, arbitrary control messages).
30
31  - It avoids the need to include OS headers in our own public headers,
32    which would pollute the environment of applications which include
33    our headers, potentially undesirably.
34
35- For OSes which do not support `sendmmsg`, we emulate it using repeated
36  calls to `sendmsg`. For OSes which do not support `sendmsg`, we emulate it
37  using `sendto` to the extent feasible. This avoids the need for code consuming
38  these new APIs to define a fallback code path.
39
40- We do not define any flags at this time, as the flags previously considered
41  for adoption cannot be supported on all platforms (Win32 does not have
42  `MSG_DONTWAIT`).
43
44- We ensure the extensibility of our `BIO_MSG` structure in a way that preserves
45  ABI compatibility using a `stride` argument which callers must set to
46  `sizeof(BIO_MSG)`. Implementations can examine the stride field to determine
47  whether a given field is part of a `BIO_MSG`. This allows us to add optional
48  fields to `BIO_MSG` at a later time without breaking ABI. All new fields must
49  be added to the end of the structure.
50
51- The BIO methods are designed to support stateless operation in which they
52  are simply calls to the equivalent system calls, where supported, without
53  changing BIO state. In particular, this means that things like retry flags are
54  not set or cleared by `BIO_sendmmsg` or `BIO_recvmmsg`.
55
56  The motivation for this is that these functions are intended to support
57  concurrent use on the same BIO. If they read or modify BIO state, they would
58  need to be synchronised with a lock, undermining performance on what (for
59  `BIO_dgram`) would otherwise be a straight system call.
60
61- We do not support iovecs. The motivations for this are:
62
63  - Not all platforms can support iovecs (e.g. Windows).
64
65  - The only way we could emulate iovecs on platforms which don't support
66    them is by copying the data to be sent into a staging buffer. This would
67    defeat all of the advantages of iovecs and prevent us from meeting our
68    zero/single-copy requirements. Moreover, it would lead to extremely
69    surprising performance variations for consumers of the API.
70
71  - We do not believe iovecs are needed to meet our performance requirements
72    for QUIC. The reason for this is that aside from a minimal packet header,
73    all data in QUIC is encrypted, so all data sent via QUIC must pass through
74    an encrypt step anyway, meaning that all data sent will already be copied
75    and there is not going to be any issue depositing the ciphertext in a
76    staging buffer together with the frame header.
77
78  - Even if we did support iovecs, we would have to impose a limit
79    on the number of iovecs supported, because we translate from our own
80    structures (as discussed above) and also intend these functions to be
81    stateless and not requiire locking. Therefore the OS-native iovec structures
82    would need to be allocated on the stack.
83
84- Sometimes, an application may wish to learn the local interface address
85  associated with a receive operation or specify the local interface address to
86  be used for a send operation. We support this, but require this functionality
87  to be explicitly enabled before use.
88
89  The reason for this is that enabling this functionality generally requires
90  that the socket be reconfigured using `setsockopt` on most platforms. Doing
91  this on-demand would require state in the BIO to determine whether this
92  functionality is currently switched on, which would require otherwise
93  unnecessary locking, undermining performance in concurrent usage of this API
94  on a given BIO. By requiring this functionality to be enabled explicitly
95  before use, this allows this initialization to be done up front without
96  performance cost. It also aids users of the API to understand that this
97  functionality is not always available and to detect when this functionality is
98  available in advance.
99
100### Design
101
102The currently proposed design is as follows:
103
104```c
105typedef struct bio_msg_st {
106    void *data;
107    size_t data_len;
108    BIO_ADDR *peer, *local;
109    uint64_t flags;
110} BIO_MSG;
111
112#define BIO_UNPACK_ERRNO(e)     /*...*/
113#define BIO_IS_ERRNO(e)         /*...*/
114
115ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride,
116                          size_t num_msg, uint64_t flags);
117ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride,
118                          size_t num_msg, uint64_t flags);
119```
120
121The API is used as follows:
122
123- `msg` points to an array of `num_msg` `BIO_MSG` structures.
124
125- Both functions have identical prototypes, and return the number of messages
126  processed in the array. If no messages were sent due to an error, `-1` is
127  returned. If an OS-level socket error occurs, a negative value `v` is
128  returned. The caller should determine that `v` is an OS-level socket error by
129  calling `BIO_IS_ERRNO(v)` and may obtain the OS-level socket error code by
130  calling `BIO_UNPACK_ERRNO(v)`.
131
132- `stride` must be set to `sizeof(BIO_MSG)`.
133
134- `data` points to the buffer of data to be sent or to be filled with received
135  data. `data_len` is the size of the buffer in bytes on call. If the
136  given message in the array is processed (i.e., if the return value
137  exceeds the index of that message in the array), `data_len` is updated
138  to the actual amount of data sent or received at return time.
139
140- `flags` in the `BIO_MSG` structure provides per-message flags to
141  the `BIO_sendmmsg` or `BIO_recvmmsg` call. If the given message in the array
142  is processed, `flags` is written with zero or more result flags at return
143  time. The `flags` argument to the call itself provides for global flags
144  affecting all messages in the array. Currently, no per-message or global flags
145  are defined and all of these fields are set to zero on call and on return.
146
147- `peer` and `local` are optional pointers to `BIO_ADDR` structures into
148  which the remote and local addresses are to be filled. If either of these
149  are NULL, the given addressing information is not requested. Local address
150  support may not be available in all circumstances, in which case processing of
151  the message fails. (This means that the function returns the number of
152  messages processed, or -1 if the message in question is the first message.)
153
154  Support for `local` must be explicitly enabled before use, otherwise
155  attempts to use it fail.
156
157Local address support is enabled as follows:
158
159```c
160int BIO_dgram_set_local_addr_enable(BIO *b, int enable);
161int BIO_dgram_get_local_addr_enable(BIO *b);
162int BIO_dgram_get_local_addr_cap(BIO *b);
163```
164
165`BIO_dgram_get_local_addr_cap()` returns 1 if local address support is
166available. It is then enabled using `BIO_dgram_set_local_addr_enable()`, which
167fails if support is not available.
168
169Options which were considered
170-----------------------------
171
172Options for the API surface which were considered included:
173
174### sendmmsg/recvmmsg-like API
175
176This design was chosen to form the basis of the adopted design, which is
177described above.
178
179```c
180int BIO_readm(BIO *b, BIO_mmsghdr *msgvec,
181              unsigned len, int flags, struct timespec *timeout);
182int BIO_writem(BIO *b, BIO_mmsghdr *msgvec,
183              unsigned len, int flags, struct timespec *timeout);
184```
185
186We can either define `BIO_mmsghdr` as a typedef of `struct mmsghdr` or redefine
187an equivalent structure. The former has the advantage that we can just pass the
188structures through to the syscall without copying them.
189
190Note that in `BIO_mem_dgram` we will have to process and therefore understand
191the contents of `struct mmsghdr` ourselves. Therefore, initially we define a
192subset of `struct mmsghdr` as being supported, specifically no control messages;
193`msg_name` and `msg_iov` only.
194
195The flags argument is defined by us. Initially we can support something like
196`MSG_DONTWAIT` (say, `BIO_DONTWAIT`).
197
198#### Implementation Questions
199
200If we go with this, there are some issues that arise:
201
202- Are `BIO_mmsghdr`, `BIO_msghdr` and `BIO_iovec` simple typedefs
203  for OS-provided structures, or our own independent structure
204  definitions?
205
206  - If we use OS-provided structures:
207
208    - We would need to include the OS headers which provide these
209      structures in our public API headers.
210
211    - If we choose to support these functions when OS support is not available
212      (see discussion below), We would need to define our own structures in this
213      case (a “polyfill” approach).
214
215  - If we use our own structures:
216
217    - We would need to translate these structures during every call.
218
219      But we would need to have storage inside the BIO_dgram for *m* `struct
220      msghdr`, *m\*v* iovecs, etc. Since we want to support multithreaded use
221      these allocations probably will need to be on the stack, and therefore
222      must be limited.
223
224      Limiting *m* isn't a problem, because `sendmmsg` returns the number
225      of messages sent, so the existing semantics we are trying to match
226      lets us just send or receive fewer messages than we were asked to.
227
228      However, it does seem like we will need to limit *v*, the number of iovecs
229      per message. So what limit should we give to *v*, the number of iovecs? We
230      will need a fixed stack allocation of OS iovec structures and we can
231      allocate from this stack allocation as we iterate through the `BIO_msghdr`
232      we have been given. So in practice we could just only send messages
233      until we reach our iovec limit, and then return.
234
235      For example, suppose we allocate 64 iovecs internally:
236
237      ```c
238      struct iovec vecs[64];
239      ```
240
241      If the first message passed to a call to `BIO_writem` has 64 iovecs
242      attached to it, no further messages can be sent and `BIO_writem`
243      returns 1.
244
245      If three messages are sent, with 32, 32, and 1 iovecs respectively,
246      the first two messages are sent and `BIO_writem` returns 2.
247
248      So the only important thing we would need to document in this API
249      is the limit of iovecs on a single message; in other words, the
250      number of iovecs which must not be exceeded if a forward progress
251      guarantee is to be made. e.g. if we allocate 64 iovecs internally,
252      `BIO_writem` with a single message with 65 iovecs will never work
253      and this becomes part of the API contract.
254
255      Obviously these quantities of iovecs are unrealistically large.
256      iovecs are small, so we can afford to set the limit high enough
257      that it shouldn't cause any problems in practice. We can increase
258      the limit later without a breaking API change, but we cannot decrease
259      it later. So we might want to start with something small, like 8.
260
261- We also need to decide what to do for OSes which don't support at least
262  `sendmsg`/`recvmsg`.
263
264  - Don't provide these functions and require all users of these functions to
265    have an alternate code path which doesn't rely on them?
266
267    - Not providing these functions on OSes that don't support
268      at least sendmsg/recvmsg is a simple solution but adds
269      complexity to code using BIO_dgram. (Though it does communicate
270      to code more realistic performance expectations since it
271      knows when these functions are actually available.)
272
273  - Provide these functions and emulate the functionality:
274
275    - However there is a question here as to how we implement
276      the iovec arguments on platforms without `sendmsg`/`recvmsg`. (We cannot
277      use `writev`/`readv` because we need peer address information.) Logically
278      implementing these would then have to be done by copying buffers around
279      internally before calling `sendto`/`recvfrom`, defeating the point of
280      iovecs and providing a performance profile which is surprising to code
281      using BIO_dgram.
282
283    - Another option could be a variable limit on the number of iovecs,
284      which can be queried from BIO_dgram. This would be a constant set
285      when libcrypto is compiled. It would be 1 for platforms not supporting
286      `sendmsg`/`recvmsg`. This again adds burdens on the code using
287      BIO_dgram, but it seems the only way to avoid the surprising performance
288      pitfall of buffer copying to emulate iovec support. There is a fair risk
289      of code being written which accidentally works on one platform but not
290      another, because the author didn't realise the iovec limit is 1 on some
291      platforms. Possibly we could have an “iovec limit” variable in the
292      BIO_dgram which is 1 by default, which can be increased by a call to a
293      function BIO_set_iovec_limit, but not beyond the fixed size discussed
294      above. It would return failure if not possible and this would give client
295      code a clear way to determine if its expectations are met.
296
297### Alternate API
298
299Could we use a simplified API? For example, could we have an API that returns
300one datagram where BIO_dgram uses `readmmsg` internally and queues the returned
301datagrams, thereby still avoiding extra syscalls but offering a simple API.
302
303The problem here is we want to support “single-copy” (where the data is only
304copied as it is decrypted). Thus BIO_dgram needs to know the final resting place
305of encrypted data at the time it makes the `readmmsg` call.
306
307One option would be to allow the user to set a callback on BIO_dgram it can use
308to request a new buffer, then have an API which returns the buffer:
309
310```c
311int BIO_dgram_set_read_callback(BIO *b,
312                                void *(*cb)(size_t len, void *arg),
313                                void *arg);
314int BIO_dgram_set_read_free_callback(BIO *b,
315                                     void (*cb)(void *buf,
316                                                size_t buf_len,
317                                                void *arg),
318                                     void *arg);
319int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len);
320```
321
322The BIO_dgram calls the specified callback when it needs to generate internal
323iovecs for its `readmmsg` call, and the received datagrams can then be popped by
324the application and freed as it likes. (The read free callback above is only
325used in rare circumstances, such as when calls to `BIO_read` and
326`BIO_read_dequeue` are alternated, or when the BIO_dgram is destroyed prior to
327all read buffers being dequeued; see below.) For convenience we could have an
328extra call to allow a buffer to be pushed back into the BIO_dgram's internal
329queue of unused read buffers, which avoids the need for the application to do
330its own management of such recycled buffers:
331
332```c
333int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len);
334```
335
336On the write side, the application provides buffers and can get a callback when
337they are freed. BIO_write_queue just queues for transmission, and the `sendmmsg`
338call is made when calling `BIO_flush`. (TBD: whether it is reasonable to
339overload the semantics of BIO_flush in this way.)
340
341```c
342int BIO_dgram_set_write_done_callback(BIO *b,
343                                      void (*cb)(const void *buf,
344                                                 size_t buf_len,
345                                                 int status,
346                                                 void *arg),
347                                      void *arg);
348int BIO_write_queue(BIO *b, const void *buf, size_t buf_len);
349int BIO_flush(BIO *b);
350```
351
352The status argument to the write done callback will be 1 on success, some
353negative value on failure, and some special negative value if the BIO_dgram is
354being freed before the write could be completed.
355
356For send/receive addresses, we import the `BIO_(set|get)_dgram_(origin|dest)`
357APIs proposed in the sendmsg/recvmsg PR (#5257). `BIO_get_dgram_(origin|dest)`
358should be called immediately after `BIO_read_dequeue` and
359`BIO_set_dgram_(origin|dest)` should be called immediately before
360`BIO_write_queue`.
361
362This approach allows `BIO_dgram` to support myriad options via composition of
363successive function calls in a “builder” style rather than via a single function
364call with an excessive number of arguments or pointers to unwieldy ever-growing
365argument structures, requiring constant revision of the central read/write
366functions of the BIO API.
367
368Note that since `BIO_set_dgram_(origin|dest)` sets data on outgoing packets and
369`BIO_get_dgram_(origin|dest)` gets data on incoming packets, it doesn't follow
370that these are accessing the same data (they are not setters and getters of a
371variables called "dgram origin" and "dgram destination", even though they look
372like setters and getters of the same variables from the name.) We probably want
373to separate these as there is no need for a getter for outgoing packet
374destination, for example, and by separating these we allow the possibility of
375multithreaded use (one thread reads, one thread writes) in the future. Possibly
376we should choose less confusing names for these functions. Maybe
377`BIO_set_outgoing_dgram_(origin|dest)` and
378`BIO_get_incoming_dgram_(origin|dest)`.
379
380Pros of this approach:
381
382  - Application can generate one datagram at a time and still get the advantages
383    of sendmmsg/recvmmsg (fewer syscalls, etc.)
384
385    We probably want this for our own QUIC implementation built on top of this
386    anyway. Otherwise we will need another piece to do basically the same thing
387    and agglomerate multiple datagrams into a single BIO call. Unless we only
388    want use `sendmmsg` constructively in trivial cases (e.g. where we send two
389    datagrams from the same function immediately after one another... doesn't
390    seem like a common use case.)
391
392  - Flexible support for single-copy (zero-copy).
393
394Cons of this approach:
395
396  - Very different way of doing reads/writes might be strange to existing
397    applications. *But* the primary consumer of this new API will be our own
398    QUIC implementation so probably not a big deal. We can always support
399    `BIO_read`/`BIO_write` as a less efficient fallback for existing third party
400    users of BIO_dgram.
401
402#### Compatibility interop
403
404Suppose the following sequence happens:
405
4061. BIO_read (legacy call path)
4072. BIO_read_dequeue (`recvmmsg` based call path with callback-allocated buffer)
4083. BIO_read (legacy call path)
409
410For (1) we have two options
411
412a. Use `recvmmsg` and add the received datagrams to an RX queue just as for the
413   `BIO_read_dequeue` path. We use an OpenSSL-provided default allocator
414   (`OPENSSL_malloc`) and flag these datagrams as needing to be freed by OpenSSL,
415   not the application.
416
417   When the application calls `BIO_read`, a copy is performed and the internal
418   buffer is freed.
419
420b. Use `recvfrom` directly. This means we have a `recvmmsg` path and a
421   `recvfrom` path depending on what API is being used.
422
423   The disadvantage of (a) is it yields an extra copy relative to what we have now,
424   whereas with (b) the buffer passed to `BIO_read` gets passed through to the
425   syscall and we do not have to copy anything.
426
427   Since we will probably need to support platforms without
428   `sendmmsg`/`recvmmsg` support anyway, (b) seems like the better option.
429
430For (2) the new API is used. Since the previous call to BIO_read is essentially
431“stateless” (it's just a simple call to `recvfrom`, and doesn't require mutation
432of any internal BIO state other than maybe the last datagram source/destination
433address fields), BIO_dgram can go ahead and start using the `recvmmsg` code
434path. Since the RX queue will obviously be empty at this point, it is
435initialised and filled using `recvmmsg`, then one datagram is popped from it.
436
437For (3) we have a legacy `BIO_read` but we have several datagrams still in the
438RX queue. In this case we do have to copy - we have no choice. However this only
439happens in circumstances where a user of BIO_dgram alternates between old and
440new APIs, which should be very unusual.
441
442Subsequently for (3) we have to free the buffer using the free callback. This is
443an unusual case where BIO_dgram is responsible for freeing read buffers and not
444the application (the only other case being premature destruction, see below).
445But since this seems a very strange API usage pattern, we may just want to fail
446in this case.
447
448Probably not worth supporting this. So we can have the following rule:
449
450- After the first call to `BIO_read_dequeue` is made on a BIO_dgram, all
451  subsequent calls to ordinary `BIO_read` will fail.
452
453Of course, all of the above applies analogously to the TX side.
454
455#### BIO_dgram_pair
456
457We will also implement from scratch a BIO_dgram_pair. This will be provided as a
458BIO pair which provides identical semantics to the BIO_dgram above, both for the
459legacy and zero-copy code paths.
460
461#### Thread safety
462
463It is a functional assumption of the above design that we would never want to
464have more than one thread doing TX on the same BIO and never have more than one
465thread doing RX on the same BIO.
466
467If we did ever want to do this, multiple BIOs on the same FD is one possibility
468(for the BIO_dgram case at least). But I don't believe there is any general
469intention to support multithreaded use of a single BIO at this time (unless I am
470mistaken), so this seems like it isn't an issue.
471
472If we wanted to support multithreaded use of the same FD using the same BIO, we
473would need to revisit the set-call-then-execute-call API approach above
474(`BIO_(set|get)_dgram_(origin|dest)`) as this would pose a problem. But I mainly
475mention this only for completeness. Our recent learnt lessons on cache
476contention suggest that this probably wouldn't be a good idea anyway.
477
478#### Other questions
479
480BIO_dgram will call the allocation function to get buffers for `recvmmsg` to
481fill. We might want to have a way to specify how many buffers it should offer to
482`recvmmsg`, and thus how many buffers it allocates in advance.
483
484#### Premature destruction
485
486If BIO_dgram is freed before all datagrams are read, the read buffer free
487callback is used to free any unreturned read buffers.
488