Normalizing Text in Java

Once in a while I see misguided attempts at normalizing text to make it suitable for use in URLs, file names, or other situations where a plain ASCII representation is desired. This can be tricky but with Java's excellent Unicode support and some background knowledge it is pretty easy to implement. At least if your input uses the Latin alphabet - otherwise you're out of luck.

Let's first have a look at a surprising special case. Depending on where you got your data from, even innocent looking code using String.replace() does not work reliably:

value = value.replace("ö", "o")

You may encounter cases where you still have an "ö" in the resulting string. The reason: in Unicode there are two representations for some characters. The German "ö", for example, can be represented as Unicode code point 246 ("LATIN SMALL LETTER O WITH DIAERESIS") or using a so-called combining character. In this case, "ö" is represented by the good old character "o" (code point 111, "LATIN SMALL LETTER O") followed by a combining character (code point 776, "COMBINING DIARESIS"). By the way, this is one of the reasons why sorting unicode strings correctly is more difficult than it looks at first glance, even when ignoring local-dependent collation rules.

If our text uses this "combining" representation, we can just use a whitelist to pick the characters we want, ignoring all those pesky decorations. The question is, how can we transcode our text into this representation?

In Unicode terms, the representation we want is "Normalization Form D" (NFD, "Canonical Decomposition"). This is easily done in Java using the java.text.Normalizer class:

value = Normalizer.normalize(value, Form.NFD);

After normalizing we can apply a filter, for example one that drops all non-ASCII characters using a clever little regex:

value = value.replaceAll("[^\\p{ASCII}]", "");

Alternatively, the following code replaces whitespace characters with "-" and drops everything that isn't a number or part of the Latin alphabet (useful for URLs):

value = value.replaceAll("\\s+", "-")
             .replaceAll("[^-a-zA-Z0-9]", "");

For more information on how Unicode works in Java, check Oracle's Working With Text tutorial.