Normalizing Text in Java

Once in a while I see misguided attempts at normalizing text to make it suitable for use in URLs, file names, or other situations where a plain ASCII representation is desired. This can be tricky but with Java’s excellent Unicode support and some background knowledge it is pretty easy to implement. At least if your input uses the Latin alphabet – otherwise you’re out of luck.

Let’s first have a look at a surprising special case. Depending on where you got your data from, even innocent looking code using String.replace() does not work reliably:

value = value.replace("ö", "o")

You may encounter cases where you still have an “ö” in the resulting string. The reason: in Unicode there are two representations for some characters. The German “ö”, for example, can be represented as Unicode code point 246 (“LATIN SMALL LETTER O WITH DIAERESIS”) or using a so-called combining character. In this case, “ö” is represented by the good old character “o” (code point 111, “LATIN SMALL LETTER O”) followed by a combining character (code point 776, “COMBINING DIARESIS”). By the way, this is one of the reasons why sorting unicode strings correctly is more difficult than it looks at first glance, even when ignoring local-dependent collation rules.

If our text uses this “combining” representation, we can just use a whitelist to pick the characters we want, ignoring all those pesky decorations. The question is, how can we transcode our text into this representation?

In Unicode terms, the representation we want is “Normalization Form D” (NFD, “Canonical Decomposition”). This is easily done in Java using the java.text.Normalizer class:

value = Normalizer.normalize(value, Form.NFD);

After normalizing we can apply a filter, for example one that drops all non-ASCII characters using a clever little regex:

value = value.replaceAll("[^\\p{ASCII}]", "");

Alternatively, the following code replaces whitespace characters with “-” and drops everything that isn’t a number or part of the Latin alphabet (useful for URLs):

value = value.replaceAll("\\s+", "-")
             .replaceAll("[^-a-zA-Z0-9]", "");

For more information on how Unicode works in Java, check Oracle’s Working With Text tutorial.

Advertisements
This entry was posted in java and tagged , . Bookmark the permalink.

2 Responses to Normalizing Text in Java

  1. I might be clueless when it comes to encodings, but if I run this code in my unit test:
    @RunWith(AndroidJUnit4.class)
    public class PlayerNameInputFilterTest {
    @Test
    public void testFilter() {
    String typedOrPastedValue = “ö”;
    CharSequence normalizedSource = Normalizer.normalize(typedOrPastedValue, Normalizer.Form.NFD);
    System.out.println(normalizedSource);
    }
    }

    My Android test shows the following in the console:
    03-03 21:11:59.030 20155-20193/? I/System.out: ö

    • Matthias says:

      I don’t have an Android environment to run this on, but what your code does is the following: It takes a string that may or may not be in normal form and normalizes it. When inspecting the result using normalizedSource.toString().toCharArray(), you should see two characters in that array. My guess is that it’s an encoding issue with your console. Does it work for you in a non-android environment?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s