URI and Paths in Java - A painful lesson

Not too long ago, a path handling bug came across my desk (i.e., a JIRA ticket was assigned to me). I could have fixed the above specific incantation of this bug with a couple of lines of code. Instead, I ended up changing almost 2000 lines of code across 96 files.

URIs are complicated, as can be inferred from the 50+ page RFC. The relevant Java standard library APIs for working with files, paths, URLs and URIs are also complicated, frequently just misdesigned and where gnarly (WONTFIX) bugs lurk. Fixing this bug properly involved scouring the specs, the APIs, digging through obscure blog posts and Stack Overflow along with their off-the-cuff bug report mentions. AI didn't help, believe it or not.

Join me down the rabbit hole as I justify why you shouldn’t be manipulating anything that might resemble a path or a URL using strings and show you how to use the relevant Java standard library APIs correctly.

What's Wrong?

In our stack we have a CLI tool in Java (don't ask) that has parameters that accepts both paths and URIs—that is, once the JVM finally starts up of course. The particular bug I was working on involved a UNC path that lost the UNC server name:

$ clitool '\\myserver\data\unload\dropme'
[INFO] Using output locations: [file://myserver/data/unload/dropme]
[ERROR] Failed: Cannot create path to: /data/unload/dropme; Cannot create path to: /data/unload/dropme

In this case, the UNC Path was converted to a URI internally, but then seemingly the UNC server name just disappeared...

The main concern when reading the code was that across the stack there were various conversions between String, URI, Path and File classes, all of which have gnarly behaviours and gotchas. Each conversion risks losing relevant information (such as UNC server names) or mishandling special characters if done incorrectly. Similarly, arbitrary path manipulation/construction logic on strings is difficult to do that works in a cross-platform way (and that also works with “remote” URIs). Hence, they should be left to the relevant methods provided by URI/Path/File as appropriate.

URI, URL, UNC - What's the Difference?

Now, I’ve thrown a few abbreviations around, such as UNC (Universal Naming Convention), URI (Uniform Resource Identifier) and URL (Uniform Resource Locator) and as we can tell from words such as “Uniform” and “Universal”, these acronyms are anything but (in terms of implementation). However, if following the specification, one can understand that both URLs and UNC paths are subsets of URIs. That partly feeds into why it is recommended to use java.net.URI instead of java.net.URL, although that isn’t the only reason. It turns out that both URL.equals() and URL.hashCode() perform DNS name resolution; as the saying goes, trust but verify! Not to worry, depending on your version of Java, you might be caching DNS cache entries forever...

It's a Bird... It's a plane... It's a URI

As the command line tool accepts both URIs and paths, can we simply just pass the string directly to the URI constructor? Well, not exactly. If you, for example, pass C:\Program Files along to the constructor, it will raise a URISyntaxException due to the space in the path. You'll need to percent encode the space as %20 to have a chance of getting it to parse However, your troubles don't end there; the Windows path uses \ as a path separator whilst URIs (and Linux and everything else...) use /. Hence, the parser will get confused and will think it's parsing an opaque (we'll get onto that concept a bit later) URI:

Illegal character in opaque part at index 2: C:\Program%20Files

The URI representation of this Windows path is file:///C:/Program%20Files, but we can't expect users to convert local paths into valid URIs (notwithstanding that it also breaks backwards compatibility).

So, with an API that allows both paths and URIs, which constructor should we pass the string to? Simply put URIs should be passed to the URI constructor and paths to either the Paths.of() static method or the File constructor. To do this, a simple heuristic is used - due to us only supporting a limit subset of URI schemes we can use this simple regex to match the start of the string ^[a-zA-Z][a-zA-Z0-9+-.]*://. If it matches, we try to parse it as a URI, otherwise we assume it's a path.

As paths are a subset of URIs, we can convert the path into a URI as follows:

URI uri = Path.of(val).toAbsolutePath().toUri();

You'll note two things from the above snippet, firstly we convert the Path to an absolute path (we accept both relative and absolute paths in our API) and secondly we call Path.of() instead of the File(URI). The reason why we convert the path to an absolute path is simply that it is more readable in the logs.

Why do we use the Path class over the File class; are they interchangeable? Indeed, conversion functions exist between all three classes but there are subtleties that means Path is preferable. The main reason is noted in the following Javadoc for File.toURI() following from the following bug report:

Note that when this abstract pathname represents a UNC pathname then all components of the UNC (including the server name component) are encoded in the URI path. The authority component is undefined, meaning that it is represented as null. The Path class defines the toUri method to encode the server name in the authority component of the resulting URI. The toPath method may be used to obtain a Path representing this abstract pathname.

What this is politely saying is that definitely don't use this method. Jokes (I'm not joking) aside, the most important point is that the File puts the UNC server name in the path component of the URI whereas the Path puts it in the authority component. This can lead to bugs using other URI methods such as URI.normalize() that are not going to be fixed.

So for simplicity, for paths we always use Path.of() to convert to a URI and then we always convert back to a path with Path.of() as well, and only then do we convert to a File (via Path.of(uri).toFile()) if we need to (say, we actually want to open the file). Note, we can't directly convert to a File from a URI constructed by Path.toURI() because the File(URI) constructor explicitly expects the authority component to be undefined, which of course will not be the case for UNC paths.

Through the Looking-Glass

In our code, we explicitly enforce that all URIs must be absolute and hierarchical. What does that mean and why would we ever enforce it? A URI can either be absolute or relative according to the following definition from the java.net.URI javadoc:

An absolute URI specifies a scheme; a URI that is not absolute is said to be relative.

In others words, a URL (e.g., http, file) is absolute and a URN is relative (e.g., mailto, about). From this definition, in our case, we know that all valid URIs will be absolute and so we can assert that we have an absolute URI.

Absolute URIs can either be hierarchical or opaque depending on whether their scheme-specific part begins with a slash character. In our case all URIs we accept are hierarchical and so we can assert on that too.

Why do we care to classify and enforce these synctactic constraints on URIs that we accept? Firstly, this gives us a quick validation at the syntactic level which is quite "cheap". Secondly, as you will realise if you ever browse the java.net.URI documentation, the behaviour of these methods heavily depends on these properties of the URI and so it makes it a lot easier to parse the documentation and to understand how certain methods will behave.

That brings us naturally into an interesting non-obvious behaviour of URI.resolve(). One would think, at first guess, that it would behave similarly to Path.resolve() but this is not the case at all. The difference is well summarised in this blog. The TL;DR of that blog is to get the behaviour similar to Path.resolve() you need to ensure that the URI has a trailing slash:

static private URI ensureTrailingSlash(URI path) {
    return (path != null && !path.toString().endsWith("/")) ? URI.create(path.toString().concat("/")) : path;
}

URI, URL, UR Right

Hopefully by this point you're armed with enough knowledge to wield these APIs correctly, and that you know what to do and what not to do.

In particular, you'll know to avoid java.net.URL in general and to reach for java.net.URI. You'll know that converting of paths to URIs can be lossy and so you always prefer converting to and from Path instead of File if you want a chance of your UNC server names to be intact. You'll know that URI.resolve() has a complicated algorithm for something that you thought would be simple. Most importantly, you'll know that you're guaranteed to forget all these nuances almost immediately and so you should bookmark this page and stay tuned for more posts.