xref: /curl/docs/URL-SYNTAX.md (revision 3040971d)
1<!--
2Copyright (C) Daniel Stenberg, <daniel@haxx.se>, et al.
3
4SPDX-License-Identifier: curl
5-->
6
7# URL syntax and their use in curl
8
9## Specifications
10
11The official "URL syntax" is primarily defined in these two different
12specifications:
13
14 - [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) (although URL is called
15   "URI" in there)
16 - [The WHATWG URL Specification](https://url.spec.whatwg.org/)
17
18RFC 3986 is the earlier one, and curl has always tried to adhere to that one
19(since it shipped in January 2005).
20
21The WHATWG URL spec was written later, is incompatible with the RFC 3986 and
22changes over time.
23
24## Variations
25
26URL parsers as implemented in browsers, libraries and tools usually opt to
27support one of the mentioned specifications. Bugs, differences in
28interpretations and the moving nature of the WHATWG spec does however make it
29unlikely that multiple parsers treat URLs the same way.
30
31## Security
32
33Due to the inherent differences between URL parser implementations, it is
34considered a security risk to mix different implementations and assume the
35same behavior.
36
37For example, if you use one parser to check if a URL uses a good hostname or
38the correct auth field, and then pass on that same URL to a *second* parser,
39there is always a risk it treats the same URL differently. There is no right
40and wrong in URL land, only differences of opinions.
41
42libcurl offers a separate API to its URL parser for this reason, among others.
43
44Applications may at times find it convenient to allow users to specify URLs
45for various purposes and that string would then end up fed to curl. Getting a
46URL from an external untrusted party and using it with curl brings several
47security concerns:
48
491. If you have an application that runs as or in a server application, getting
50   an unfiltered URL can trick your application to access a local resource
51   instead of a remote resource. Protecting yourself against localhost accesses
52   is hard when accepting user provided URLs.
53
542. Such custom URLs can access other ports than you planned as port numbers
55   are part of the regular URL format. The combination of a local host and a
56   custom port number can allow external users to play tricks with your local
57   services.
58
593. Such a URL might use other schemes than you thought of or planned for.
60
61## "RFC 3986 plus"
62
63curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on
64the well established RFC 3986 to make sure previously written command lines
65and curl using scripts remain working.
66
67curl's URL parser allows a few deviations from the spec in order to
68inter-operate better with URLs that appear in the wild.
69
70### Spaces
71
72A URL provided to curl cannot contain spaces. They need to be provided URL
73encoded to be accepted in a URL by curl.
74
75An exception to this rule: `Location:` response headers that indicate to a
76client where a resource has been redirected to, sometimes contain spaces. This
77is a violation of RFC 3986 but is fine in the WHATWG spec. curl handles these
78by re-encoding them to `%20`.
79
80### Non-ASCII
81
82Byte values in a provided URL that are outside of the printable ASCII range
83are percent-encoded by curl.
84
85### Multiple slashes
86
87An absolute URL always starts with a "scheme" followed by a colon. For all the
88schemes curl supports, the colon must be followed by two slashes according to
89RFC 3986 but not according to the WHATWG spec - which allows one to infinity
90amount.
91
92curl allows one, two or three slashes after the colon to still be considered a
93valid URL.
94
95### "scheme-less"
96
97curl supports "URLs" that do not start with a scheme. This is not supported by
98any of the specifications. This is a shortcut to entering URLs that was
99supported by browsers early on and has been mimicked by curl.
100
101Based on what the hostname starts with, curl "guesses" what protocol to use:
102
103 - `ftp.` means FTP
104 - `dict.` means DICT
105 - `ldap.` means LDAP
106 - `imap.` means IMAP
107 - `smtp.` means SMTP
108 - `pop3.` means POP3
109 - all other means HTTP
110
111### Globbing letters
112
113The curl command line tool supports "globbing" of URLs. It means that you can
114create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The
115letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not
116legitimately be part of such a URL.
117
118They are however not reserved or special in the WHATWG specification, so
119globbing can mess up such URLs. Globbing can be turned off for such occasions
120(using `--globoff`).
121
122# URL syntax details
123
124A URL may consist of the following components - many of them are optional:
125
126    [scheme][divider][userinfo][hostname][port number][path][query][fragment]
127
128Each component is separated from the following component with a divider
129character or string.
130
131For example, this could look like:
132
133    http://user:password@www.example.com:80/index.html?foo=bar#top
134
135## Scheme
136
137The scheme specifies the protocol to use. A curl build can support a few or
138many different schemes. You can limit what schemes curl should accept.
139
140curl supports the following schemes on URLs specified to transfer. They are
141matched case insensitively:
142
143`dict`, `file`, `ftp`, `ftps`, `gopher`, `gophers`, `http`, `https`, `imap`,
144`imaps`, `ldap`, `ldaps`, `mqtt`, `pop3`, `pop3s`, `rtmp`, `rtmpe`, `rtmps`,
145`rtmpt`, `rtmpte`, `rtmpts`, `rtsp`, `smb`, `smbs`, `smtp`, `smtps`, `telnet`,
146`tftp`
147
148When the URL is specified to identify a proxy, curl recognizes the following
149schemes:
150
151`http`, `https`, `socks4`, `socks4a`, `socks5`, `socks5h`, `socks`
152
153## Userinfo
154
155The userinfo field can be used to set username and password for
156authentication purposes in this transfer. The use of this field is discouraged
157since it often means passing around the password in plain text and is thus a
158security risk.
159
160URLs for IMAP, POP3 and SMTP also support *login options* as part of the
161userinfo field. They are provided as a semicolon after the password and then
162the options.
163
164## Hostname
165
166The hostname part of the URL contains the address of the server that you want
167to connect to. This can be the fully qualified domain name of the server, the
168local network name of the machine on your network or the IP address of the
169server or machine represented by either an IPv4 or IPv6 address (within
170brackets). For example:
171
172    http://www.example.com/
173
174    http://hostname/
175
176    http://192.168.0.1/
177
178    http://[2001:1890:1112:1::20]/
179
180### "localhost"
181
182Starting in curl 7.77.0, curl uses loopback IP addresses for the name
183`localhost`: `127.0.0.1` and `::1`. It does not resolve the name using the
184resolver functions.
185
186This is done to make sure the host accessed is truly the localhost - the local
187machine.
188
189### IDNA
190
191If curl was built with International Domain Name (IDN) support, it can also
192handle hostnames using non-ASCII characters.
193
194When built with libidn2, curl uses the IDNA 2008 standard. This is equivalent
195to the WHATWG URL spec, but differs from certain browsers that use IDNA 2003
196Transitional Processing. The two standards have a huge overlap but differ
197slightly, perhaps most famously in how they deal with the German "double s"
198(`ß`).
199
200When WinIDN is used, curl uses IDNA 2003 Transitional Processing, like the rest
201of Windows.
202
203## Port number
204
205If there is a colon after the hostname, that should be followed by the port
206number to use. 1 - 65535. curl also supports a blank port number field - but
207only if the URL starts with a scheme.
208
209If the port number is not specified in the URL, curl uses a default port
210number based on the provide scheme:
211
212DICT 2628, FTP 21, FTPS 990, GOPHER 70, GOPHERS 70, HTTP 80, HTTPS 443,
213IMAP 132, IMAPS 993, LDAP 369, LDAPS 636, MQTT 1883, POP3 110, POP3S 995,
214RTMP 1935, RTMPS 443, RTMPT 80, RTSP 554, SCP 22, SFTP 22, SMB 445, SMBS 445,
215SMTP 25, SMTPS 465, TELNET 23, TFTP 69
216
217# Scheme specific behaviors
218
219## FTP
220
221The path part of an FTP request specifies the file to retrieve and from which
222directory. If the file part is omitted then libcurl downloads the directory
223listing for the directory specified. If the directory is omitted then the
224directory listing for the root / home directory is returned.
225
226FTP servers typically put the user in its "home directory" after login, which
227then differs between users. To explicitly specify the root directory of an FTP
228server, start the path with double slash `//` or `/%2f` (2F is the hexadecimal
229value of the ASCII code for the slash).
230
231## FILE
232
233When a `FILE://` URL is accessed on Windows systems, it can be crafted in a
234way so that Windows attempts to connect to a (remote) machine when curl wants
235to read or write such a path.
236
237curl only allows the hostname part of a FILE URL to be one out of these three
238alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters).
239Anything else makes curl fail to parse the URL.
240
241### Windows-specific FILE details
242
243curl accepts that the FILE URL's path starts with a "drive letter". That is a
244single letter `a` to `z` followed by a colon or a pipe character (`|`).
245
246The Windows operating system itself converts some file accesses to perform
247network accesses over SMB/CIFS, through several different file path patterns.
248This way, a `file://` URL passed to curl *might* be converted into a network
249access inadvertently and unknowingly to curl. This is a Windows feature curl
250cannot control or disable.
251
252## IMAP
253
254The path part of an IMAP request not only specifies the mailbox to list or
255select, but can also be used to check the `UIDVALIDITY` of the mailbox, to
256specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and
257to specify what messages to search for.
258
259A top level folder list:
260
261    imap://user:password@mail.example.com
262
263A folder list on the user's inbox:
264
265    imap://user:password@mail.example.com/INBOX
266
267Select the user's inbox and fetch message with `uid = 1`:
268
269    imap://user:password@mail.example.com/INBOX/;UID=1
270
271Select the user's inbox and fetch the first message in the mail box:
272
273    imap://user:password@mail.example.com/INBOX/;MAILINDEX=1
274
275Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and
276fetch message 2 if it is:
277
278    imap://user:password@mail.example.com/INBOX;UIDVALIDITY=50/;UID=2
279
280Select the user's inbox and fetch the text portion of message 3:
281
282    imap://user:password@mail.example.com/INBOX/;UID=3/;SECTION=TEXT
283
284Select the user's inbox and fetch the first 1024 octets of message 4:
285
286    imap://user:password@mail.example.com/INBOX/;UID=4/;PARTIAL=0.1024
287
288Select the user's inbox and check for NEW messages:
289
290    imap://user:password@mail.example.com/INBOX?NEW
291
292Select the user's inbox and search for messages containing "shadows" in the
293subject line:
294
295    imap://user:password@mail.example.com/INBOX?SUBJECT%20shadows
296
297Searching via the query part of the URL `?` is a search request for the
298results to be returned as message sequence numbers (`MAILINDEX`). It is
299possible to make a search request for results to be returned as unique ID
300numbers (`UID`) by using a custom curl request via `-X`. `UID` numbers are
301unique per session (and multiple sessions when `UIDVALIDITY` is the same). For
302example, if you are searching for `"foo bar"` in header+body (`TEXT`) and you
303want the matching `MAILINDEX` numbers returned then you could search via URL:
304
305    imap://user:password@mail.example.com/INBOX?TEXT%20%22foo%20bar%22
306
307If you want matching `UID` numbers you have to use a custom request:
308
309    imap://user:password@mail.example.com/INBOX -X "UID SEARCH TEXT \"foo bar\""
310
311For more information about IMAP commands please see RFC 9051. For more
312information about the individual components of an IMAP URL please see RFC 5092.
313
314* Note old curl versions would `FETCH` by message sequence number when `UID`
315was specified in the URL. That was a bug fixed in 7.62.0, which added
316`MAILINDEX` to `FETCH` by mail sequence number.
317
318## LDAP
319
320The path part of a LDAP request can be used to specify the: Distinguished
321Name, Attributes, Scope, Filter and Extension for a LDAP search. Each field is
322separated by a question mark and when that field is not required an empty
323string with the question mark separator should be included.
324
325Search for the `DN` as `My Organization`:
326
327    ldap://ldap.example.com/o=My%20Organization
328
329the same search but only return `postalAddress` attributes:
330
331    ldap://ldap.example.com/o=My%20Organization?postalAddress
332
333Search for an empty `DN` and request information about the
334`rootDomainNamingContext` attribute for an Active Directory server:
335
336    ldap://ldap.example.com/?rootDomainNamingContext
337
338For more information about the individual components of a LDAP URL please
339see [RFC 4516](https://datatracker.ietf.org/doc/html/rfc4516).
340
341## POP3
342
343The path part of a POP3 request specifies the message ID to retrieve. If the
344ID is not specified then a list of waiting messages is returned instead.
345
346## SCP
347
348The path part of an SCP URL specifies the path and file to retrieve or
349upload. The file is taken as an absolute path from the root directory on the
350server.
351
352To specify a path relative to the user's home directory on the server, prepend
353`~/` to the path portion.
354
355## SFTP
356
357The path part of an SFTP URL specifies the file to retrieve or upload. If the
358path ends with a slash (`/`) then a directory listing is returned instead of a
359file. If the path is omitted entirely then the directory listing for the root
360/ home directory is returned.
361
362## SMB
363The path part of a SMB request specifies the file to retrieve and from what
364share and directory or the share to upload to and as such, may not be omitted.
365If the username is embedded in the URL then it must contain the domain name
366and as such, the backslash must be URL encoded as %2f.
367
368When uploading to SMB, the size of the file needs to be known ahead of time,
369meaning that you can upload a file passed to curl over a pipe like stdin.
370
371curl supports SMB version 1 (only)
372
373## SMTP
374
375The path part of a SMTP request specifies the hostname to present during
376communication with the mail server. If the path is omitted, then libcurl
377attempts to resolve the local computer's hostname. However, this may not
378return the fully qualified domain name that is required by some mail servers
379and specifying this path allows you to set an alternative name, such as your
380machine's fully qualified domain name, which you might have obtained from an
381external function such as gethostname or getaddrinfo.
382
383The default smtp port is 25. Some servers use port 587 as an alternative.
384
385## RTMP
386
387There is no official URL spec for RTMP so libcurl uses the URL syntax supported
388by the underlying librtmp library. It has a syntax where it wants a
389traditional URL, followed by a space and a series of space-separated
390`name=value` pairs.
391
392While space is not typically a "legal" letter, libcurl accepts them. When a
393user wants to pass in a `#` (hash) character it is treated as a fragment and
394it gets cut off by libcurl if provided literally. You have to escape it by
395providing it as backslash and its ASCII value in hexadecimal: `\23`.
396