1Datagram BIO API revisions for sendmmsg/recvmmsg 2================================================ 3 4We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the 5eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be 6sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2). 7 8The adopted design 9------------------ 10 11### Design decisions 12 13The adopted design makes the following design decisions: 14 15- We use a sendmmsg/recvmmsg-like API. The alternative API was not considered 16 for adoption because it is an explicit goal that the adopted API be suitable 17 for concurrent use on the same BIO. 18 19- We define our own structures rather than using the OS's `struct mmsghdr`. 20 The motivations for this are: 21 22 - It ensures portability between OSes and allows the API to be used 23 on OSes which do not support `sendmmsg` or `sendmsg`. 24 25 - It allows us to use structures in keeping with OpenSSL's existing 26 abstraction layers (e.g. `BIO_ADDR` rather than `struct sockaddr`). 27 28 - We do not have to expose functionality which we cannot guarantee 29 we can support on all platforms (for example, arbitrary control messages). 30 31 - It avoids the need to include OS headers in our own public headers, 32 which would pollute the environment of applications which include 33 our headers, potentially undesirably. 34 35- For OSes which do not support `sendmmsg`, we emulate it using repeated 36 calls to `sendmsg`. For OSes which do not support `sendmsg`, we emulate it 37 using `sendto` to the extent feasible. This avoids the need for code consuming 38 these new APIs to define a fallback code path. 39 40- We do not define any flags at this time, as the flags previously considered 41 for adoption cannot be supported on all platforms (Win32 does not have 42 `MSG_DONTWAIT`). 43 44- We ensure the extensibility of our `BIO_MSG` structure in a way that preserves 45 ABI compatibility using a `stride` argument which callers must set to 46 `sizeof(BIO_MSG)`. Implementations can examine the stride field to determine 47 whether a given field is part of a `BIO_MSG`. This allows us to add optional 48 fields to `BIO_MSG` at a later time without breaking ABI. All new fields must 49 be added to the end of the structure. 50 51- The BIO methods are designed to support stateless operation in which they 52 are simply calls to the equivalent system calls, where supported, without 53 changing BIO state. In particular, this means that things like retry flags are 54 not set or cleared by `BIO_sendmmsg` or `BIO_recvmmsg`. 55 56 The motivation for this is that these functions are intended to support 57 concurrent use on the same BIO. If they read or modify BIO state, they would 58 need to be synchronised with a lock, undermining performance on what (for 59 `BIO_dgram`) would otherwise be a straight system call. 60 61- We do not support iovecs. The motivations for this are: 62 63 - Not all platforms can support iovecs (e.g. Windows). 64 65 - The only way we could emulate iovecs on platforms which don't support 66 them is by copying the data to be sent into a staging buffer. This would 67 defeat all of the advantages of iovecs and prevent us from meeting our 68 zero/single-copy requirements. Moreover, it would lead to extremely 69 surprising performance variations for consumers of the API. 70 71 - We do not believe iovecs are needed to meet our performance requirements 72 for QUIC. The reason for this is that aside from a minimal packet header, 73 all data in QUIC is encrypted, so all data sent via QUIC must pass through 74 an encrypt step anyway, meaning that all data sent will already be copied 75 and there is not going to be any issue depositing the ciphertext in a 76 staging buffer together with the frame header. 77 78 - Even if we did support iovecs, we would have to impose a limit 79 on the number of iovecs supported, because we translate from our own 80 structures (as discussed above) and also intend these functions to be 81 stateless and not requiire locking. Therefore the OS-native iovec structures 82 would need to be allocated on the stack. 83 84- Sometimes, an application may wish to learn the local interface address 85 associated with a receive operation or specify the local interface address to 86 be used for a send operation. We support this, but require this functionality 87 to be explicitly enabled before use. 88 89 The reason for this is that enabling this functionality generally requires 90 that the socket be reconfigured using `setsockopt` on most platforms. Doing 91 this on-demand would require state in the BIO to determine whether this 92 functionality is currently switched on, which would require otherwise 93 unnecessary locking, undermining performance in concurrent usage of this API 94 on a given BIO. By requiring this functionality to be enabled explicitly 95 before use, this allows this initialization to be done up front without 96 performance cost. It also aids users of the API to understand that this 97 functionality is not always available and to detect when this functionality is 98 available in advance. 99 100### Design 101 102The currently proposed design is as follows: 103 104```c 105typedef struct bio_msg_st { 106 void *data; 107 size_t data_len; 108 BIO_ADDR *peer, *local; 109 uint64_t flags; 110} BIO_MSG; 111 112#define BIO_UNPACK_ERRNO(e) /*...*/ 113#define BIO_IS_ERRNO(e) /*...*/ 114 115ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride, 116 size_t num_msg, uint64_t flags); 117ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride, 118 size_t num_msg, uint64_t flags); 119``` 120 121The API is used as follows: 122 123- `msg` points to an array of `num_msg` `BIO_MSG` structures. 124 125- Both functions have identical prototypes, and return the number of messages 126 processed in the array. If no messages were sent due to an error, `-1` is 127 returned. If an OS-level socket error occurs, a negative value `v` is 128 returned. The caller should determine that `v` is an OS-level socket error by 129 calling `BIO_IS_ERRNO(v)` and may obtain the OS-level socket error code by 130 calling `BIO_UNPACK_ERRNO(v)`. 131 132- `stride` must be set to `sizeof(BIO_MSG)`. 133 134- `data` points to the buffer of data to be sent or to be filled with received 135 data. `data_len` is the size of the buffer in bytes on call. If the 136 given message in the array is processed (i.e., if the return value 137 exceeds the index of that message in the array), `data_len` is updated 138 to the actual amount of data sent or received at return time. 139 140- `flags` in the `BIO_MSG` structure provides per-message flags to 141 the `BIO_sendmmsg` or `BIO_recvmmsg` call. If the given message in the array 142 is processed, `flags` is written with zero or more result flags at return 143 time. The `flags` argument to the call itself provides for global flags 144 affecting all messages in the array. Currently, no per-message or global flags 145 are defined and all of these fields are set to zero on call and on return. 146 147- `peer` and `local` are optional pointers to `BIO_ADDR` structures into 148 which the remote and local addresses are to be filled. If either of these 149 are NULL, the given addressing information is not requested. Local address 150 support may not be available in all circumstances, in which case processing of 151 the message fails. (This means that the function returns the number of 152 messages processed, or -1 if the message in question is the first message.) 153 154 Support for `local` must be explicitly enabled before use, otherwise 155 attempts to use it fail. 156 157Local address support is enabled as follows: 158 159```c 160int BIO_dgram_set_local_addr_enable(BIO *b, int enable); 161int BIO_dgram_get_local_addr_enable(BIO *b); 162int BIO_dgram_get_local_addr_cap(BIO *b); 163``` 164 165`BIO_dgram_get_local_addr_cap()` returns 1 if local address support is 166available. It is then enabled using `BIO_dgram_set_local_addr_enable()`, which 167fails if support is not available. 168 169Options which were considered 170----------------------------- 171 172Options for the API surface which were considered included: 173 174### sendmmsg/recvmmsg-like API 175 176This design was chosen to form the basis of the adopted design, which is 177described above. 178 179```c 180int BIO_readm(BIO *b, BIO_mmsghdr *msgvec, 181 unsigned len, int flags, struct timespec *timeout); 182int BIO_writem(BIO *b, BIO_mmsghdr *msgvec, 183 unsigned len, int flags, struct timespec *timeout); 184``` 185 186We can either define `BIO_mmsghdr` as a typedef of `struct mmsghdr` or redefine 187an equivalent structure. The former has the advantage that we can just pass the 188structures through to the syscall without copying them. 189 190Note that in `BIO_mem_dgram` we will have to process and therefore understand 191the contents of `struct mmsghdr` ourselves. Therefore, initially we define a 192subset of `struct mmsghdr` as being supported, specifically no control messages; 193`msg_name` and `msg_iov` only. 194 195The flags argument is defined by us. Initially we can support something like 196`MSG_DONTWAIT` (say, `BIO_DONTWAIT`). 197 198#### Implementation Questions 199 200If we go with this, there are some issues that arise: 201 202- Are `BIO_mmsghdr`, `BIO_msghdr` and `BIO_iovec` simple typedefs 203 for OS-provided structures, or our own independent structure 204 definitions? 205 206 - If we use OS-provided structures: 207 208 - We would need to include the OS headers which provide these 209 structures in our public API headers. 210 211 - If we choose to support these functions when OS support is not available 212 (see discussion below), We would need to define our own structures in this 213 case (a “polyfill” approach). 214 215 - If we use our own structures: 216 217 - We would need to translate these structures during every call. 218 219 But we would need to have storage inside the BIO_dgram for *m* `struct 220 msghdr`, *m\*v* iovecs, etc. Since we want to support multithreaded use 221 these allocations probably will need to be on the stack, and therefore 222 must be limited. 223 224 Limiting *m* isn't a problem, because `sendmmsg` returns the number 225 of messages sent, so the existing semantics we are trying to match 226 lets us just send or receive fewer messages than we were asked to. 227 228 However, it does seem like we will need to limit *v*, the number of iovecs 229 per message. So what limit should we give to *v*, the number of iovecs? We 230 will need a fixed stack allocation of OS iovec structures and we can 231 allocate from this stack allocation as we iterate through the `BIO_msghdr` 232 we have been given. So in practice we could just only send messages 233 until we reach our iovec limit, and then return. 234 235 For example, suppose we allocate 64 iovecs internally: 236 237 ```c 238 struct iovec vecs[64]; 239 ``` 240 241 If the first message passed to a call to `BIO_writem` has 64 iovecs 242 attached to it, no further messages can be sent and `BIO_writem` 243 returns 1. 244 245 If three messages are sent, with 32, 32, and 1 iovecs respectively, 246 the first two messages are sent and `BIO_writem` returns 2. 247 248 So the only important thing we would need to document in this API 249 is the limit of iovecs on a single message; in other words, the 250 number of iovecs which must not be exceeded if a forward progress 251 guarantee is to be made. e.g. if we allocate 64 iovecs internally, 252 `BIO_writem` with a single message with 65 iovecs will never work 253 and this becomes part of the API contract. 254 255 Obviously these quantities of iovecs are unrealistically large. 256 iovecs are small, so we can afford to set the limit high enough 257 that it shouldn't cause any problems in practice. We can increase 258 the limit later without a breaking API change, but we cannot decrease 259 it later. So we might want to start with something small, like 8. 260 261- We also need to decide what to do for OSes which don't support at least 262 `sendmsg`/`recvmsg`. 263 264 - Don't provide these functions and require all users of these functions to 265 have an alternate code path which doesn't rely on them? 266 267 - Not providing these functions on OSes that don't support 268 at least sendmsg/recvmsg is a simple solution but adds 269 complexity to code using BIO_dgram. (Though it does communicate 270 to code more realistic performance expectations since it 271 knows when these functions are actually available.) 272 273 - Provide these functions and emulate the functionality: 274 275 - However there is a question here as to how we implement 276 the iovec arguments on platforms without `sendmsg`/`recvmsg`. (We cannot 277 use `writev`/`readv` because we need peer address information.) Logically 278 implementing these would then have to be done by copying buffers around 279 internally before calling `sendto`/`recvfrom`, defeating the point of 280 iovecs and providing a performance profile which is surprising to code 281 using BIO_dgram. 282 283 - Another option could be a variable limit on the number of iovecs, 284 which can be queried from BIO_dgram. This would be a constant set 285 when libcrypto is compiled. It would be 1 for platforms not supporting 286 `sendmsg`/`recvmsg`. This again adds burdens on the code using 287 BIO_dgram, but it seems the only way to avoid the surprising performance 288 pitfall of buffer copying to emulate iovec support. There is a fair risk 289 of code being written which accidentally works on one platform but not 290 another, because the author didn't realise the iovec limit is 1 on some 291 platforms. Possibly we could have an “iovec limit” variable in the 292 BIO_dgram which is 1 by default, which can be increased by a call to a 293 function BIO_set_iovec_limit, but not beyond the fixed size discussed 294 above. It would return failure if not possible and this would give client 295 code a clear way to determine if its expectations are met. 296 297### Alternate API 298 299Could we use a simplified API? For example, could we have an API that returns 300one datagram where BIO_dgram uses `readmmsg` internally and queues the returned 301datagrams, thereby still avoiding extra syscalls but offering a simple API. 302 303The problem here is we want to support “single-copy” (where the data is only 304copied as it is decrypted). Thus BIO_dgram needs to know the final resting place 305of encrypted data at the time it makes the `readmmsg` call. 306 307One option would be to allow the user to set a callback on BIO_dgram it can use 308to request a new buffer, then have an API which returns the buffer: 309 310```c 311int BIO_dgram_set_read_callback(BIO *b, 312 void *(*cb)(size_t len, void *arg), 313 void *arg); 314int BIO_dgram_set_read_free_callback(BIO *b, 315 void (*cb)(void *buf, 316 size_t buf_len, 317 void *arg), 318 void *arg); 319int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len); 320``` 321 322The BIO_dgram calls the specified callback when it needs to generate internal 323iovecs for its `readmmsg` call, and the received datagrams can then be popped by 324the application and freed as it likes. (The read free callback above is only 325used in rare circumstances, such as when calls to `BIO_read` and 326`BIO_read_dequeue` are alternated, or when the BIO_dgram is destroyed prior to 327all read buffers being dequeued; see below.) For convenience we could have an 328extra call to allow a buffer to be pushed back into the BIO_dgram's internal 329queue of unused read buffers, which avoids the need for the application to do 330its own management of such recycled buffers: 331 332```c 333int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len); 334``` 335 336On the write side, the application provides buffers and can get a callback when 337they are freed. BIO_write_queue just queues for transmission, and the `sendmmsg` 338call is made when calling `BIO_flush`. (TBD: whether it is reasonable to 339overload the semantics of BIO_flush in this way.) 340 341```c 342int BIO_dgram_set_write_done_callback(BIO *b, 343 void (*cb)(const void *buf, 344 size_t buf_len, 345 int status, 346 void *arg), 347 void *arg); 348int BIO_write_queue(BIO *b, const void *buf, size_t buf_len); 349int BIO_flush(BIO *b); 350``` 351 352The status argument to the write done callback will be 1 on success, some 353negative value on failure, and some special negative value if the BIO_dgram is 354being freed before the write could be completed. 355 356For send/receive addresses, we import the `BIO_(set|get)_dgram_(origin|dest)` 357APIs proposed in the sendmsg/recvmsg PR (#5257). `BIO_get_dgram_(origin|dest)` 358should be called immediately after `BIO_read_dequeue` and 359`BIO_set_dgram_(origin|dest)` should be called immediately before 360`BIO_write_queue`. 361 362This approach allows `BIO_dgram` to support myriad options via composition of 363successive function calls in a “builder” style rather than via a single function 364call with an excessive number of arguments or pointers to unwieldy ever-growing 365argument structures, requiring constant revision of the central read/write 366functions of the BIO API. 367 368Note that since `BIO_set_dgram_(origin|dest)` sets data on outgoing packets and 369`BIO_get_dgram_(origin|dest)` gets data on incoming packets, it doesn't follow 370that these are accessing the same data (they are not setters and getters of a 371variables called "dgram origin" and "dgram destination", even though they look 372like setters and getters of the same variables from the name.) We probably want 373to separate these as there is no need for a getter for outgoing packet 374destination, for example, and by separating these we allow the possibility of 375multithreaded use (one thread reads, one thread writes) in the future. Possibly 376we should choose less confusing names for these functions. Maybe 377`BIO_set_outgoing_dgram_(origin|dest)` and 378`BIO_get_incoming_dgram_(origin|dest)`. 379 380Pros of this approach: 381 382 - Application can generate one datagram at a time and still get the advantages 383 of sendmmsg/recvmmsg (fewer syscalls, etc.) 384 385 We probably want this for our own QUIC implementation built on top of this 386 anyway. Otherwise we will need another piece to do basically the same thing 387 and agglomerate multiple datagrams into a single BIO call. Unless we only 388 want use `sendmmsg` constructively in trivial cases (e.g. where we send two 389 datagrams from the same function immediately after one another... doesn't 390 seem like a common use case.) 391 392 - Flexible support for single-copy (zero-copy). 393 394Cons of this approach: 395 396 - Very different way of doing reads/writes might be strange to existing 397 applications. *But* the primary consumer of this new API will be our own 398 QUIC implementation so probably not a big deal. We can always support 399 `BIO_read`/`BIO_write` as a less efficient fallback for existing third party 400 users of BIO_dgram. 401 402#### Compatibility interop 403 404Suppose the following sequence happens: 405 4061. BIO_read (legacy call path) 4072. BIO_read_dequeue (`recvmmsg` based call path with callback-allocated buffer) 4083. BIO_read (legacy call path) 409 410For (1) we have two options 411 412a. Use `recvmmsg` and add the received datagrams to an RX queue just as for the 413 `BIO_read_dequeue` path. We use an OpenSSL-provided default allocator 414 (`OPENSSL_malloc`) and flag these datagrams as needing to be freed by OpenSSL, 415 not the application. 416 417 When the application calls `BIO_read`, a copy is performed and the internal 418 buffer is freed. 419 420b. Use `recvfrom` directly. This means we have a `recvmmsg` path and a 421 `recvfrom` path depending on what API is being used. 422 423 The disadvantage of (a) is it yields an extra copy relative to what we have now, 424 whereas with (b) the buffer passed to `BIO_read` gets passed through to the 425 syscall and we do not have to copy anything. 426 427 Since we will probably need to support platforms without 428 `sendmmsg`/`recvmmsg` support anyway, (b) seems like the better option. 429 430For (2) the new API is used. Since the previous call to BIO_read is essentially 431“stateless” (it's just a simple call to `recvfrom`, and doesn't require mutation 432of any internal BIO state other than maybe the last datagram source/destination 433address fields), BIO_dgram can go ahead and start using the `recvmmsg` code 434path. Since the RX queue will obviously be empty at this point, it is 435initialised and filled using `recvmmsg`, then one datagram is popped from it. 436 437For (3) we have a legacy `BIO_read` but we have several datagrams still in the 438RX queue. In this case we do have to copy - we have no choice. However this only 439happens in circumstances where a user of BIO_dgram alternates between old and 440new APIs, which should be very unusual. 441 442Subsequently for (3) we have to free the buffer using the free callback. This is 443an unusual case where BIO_dgram is responsible for freeing read buffers and not 444the application (the only other case being premature destruction, see below). 445But since this seems a very strange API usage pattern, we may just want to fail 446in this case. 447 448Probably not worth supporting this. So we can have the following rule: 449 450- After the first call to `BIO_read_dequeue` is made on a BIO_dgram, all 451 subsequent calls to ordinary `BIO_read` will fail. 452 453Of course, all of the above applies analogously to the TX side. 454 455#### BIO_dgram_pair 456 457We will also implement from scratch a BIO_dgram_pair. This will be provided as a 458BIO pair which provides identical semantics to the BIO_dgram above, both for the 459legacy and zero-copy code paths. 460 461#### Thread safety 462 463It is a functional assumption of the above design that we would never want to 464have more than one thread doing TX on the same BIO and never have more than one 465thread doing RX on the same BIO. 466 467If we did ever want to do this, multiple BIOs on the same FD is one possibility 468(for the BIO_dgram case at least). But I don't believe there is any general 469intention to support multithreaded use of a single BIO at this time (unless I am 470mistaken), so this seems like it isn't an issue. 471 472If we wanted to support multithreaded use of the same FD using the same BIO, we 473would need to revisit the set-call-then-execute-call API approach above 474(`BIO_(set|get)_dgram_(origin|dest)`) as this would pose a problem. But I mainly 475mention this only for completeness. Our recent learnt lessons on cache 476contention suggest that this probably wouldn't be a good idea anyway. 477 478#### Other questions 479 480BIO_dgram will call the allocation function to get buffers for `recvmmsg` to 481fill. We might want to have a way to specify how many buffers it should offer to 482`recvmmsg`, and thus how many buffers it allocates in advance. 483 484#### Premature destruction 485 486If BIO_dgram is freed before all datagrams are read, the read buffer free 487callback is used to free any unreturned read buffers. 488