Multiline Regular Expression

Mats Stijlaart

Published: 15 April, 2013

This week I had to help a colleague out with a regular expression (or regex). We came up with an interesting regex but ran into an unexpected Java problem, so I thought I'd share the results with you.


The Regex

The regex had to match all sub-strings in a bigger string that:

A) Started at the beginning of a line with a hash symbol ('#')
B) Contained the letters 'U' and 'C'
C) Ended with a number (optionally with sub-numbering)

Examples of valid matches would be:

#UC1.1
#UC21.12
#uC1.1
#Uc1.1.1
#uc1.1
#UC1

Example of invalid matches would be:

#UC1.
#UC.12

The first part of the regex should focus on the 'Start at the beginning of a line'. Therefore we use the '^' character to indicate that our regular expression should match the beginning of a line. We also know that the first character must always be '#'. So to match this, we use '^#'.

The letters 'U' and 'C' should also be matched. We should then extend our regex as: '^#uc'. In this case it would also be relevant to use the 'case insensitive' option. In most programming languages you are able to make the regex case insensitive by using this format: '/^#uc/i'. The 'i' option indicates the case insensitivity.

The last part of the regex would then focus on the number and optional sub-numbering. First we need to match one or more numbers in the scale of 0 to 9. This is done easily using '[0-9]+'. The '+' indicates 1 or more of the preceding value. The preceding value is contained in the square brackets and indicates that any value between zero and nine is a valid match.

This doesn't allow for the sub-numbering however. The sub-numbering is based on a dot and one or more numbers that could be repeated zero or more times. In '1.11.12' the '.11' and the '.12' would be matches. To make a valid match in a regex we have taken the following steps:

Regex Explanation
.
We match a dot. A normal '.' will match everything, so we have to escape it.
.[0-9]
We match all numbers
.[0-9]+
When there are numbers present we require one or more of them.
(.[0-9]+)*
This part of a regex is en enclosing group that we want to repeat zero or more times.

Result

When we combine all this, the result regex would be the following:

/^#uc[0-9]+(.[0-9]+)*/i

Embedded in Java

To embed our regex in Java we use the following code:

String text = ...
Pattern pattern = Pattern.compile("^#uc[0-9]+(\\.[0-9]+)*",
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);

First you have to make a pattern and compile it with options. The 'Pattern.CASE_INSENSITIVE' value is the equivalent of the 'i' option.
The matcher object is able to dertermine all matching strings in the text string.

The above code was used in a Java method to return a list of matching strings. To validate the validity of the method, my colleague wrote some unit tests.

public void test1() throws Exception {
List result = patternExtractor.getMatches("#uc1.1");
assertThat(result.size(), is(1));
assertThat(result.get(0), is("#uc1.1"));
}

public void test2() throws Exception {
List result = patternExtractor.getMatches("#UC2");
assertThat(result.size(), is(1));
assertThat(result.get(0), is("#UC2"));
}

public void test3() throws Exception {
List result = patternExtractor.getMatches("#UC1.1\n#UC2.2");
assertThat(result.size(), is(2));
assertThat(result.get(0), is("#UC1.1"));
assertThat(result.get(0), is("#UC2.2"));
}

We expected that all tests would pass, but they didn't. The third test series failed. It checks two expected matches on a different line. The test returned only one result ('#UC1.1'). We found that to be strange behaviour since we had stated in the regular expression that we would start matching on the beginning of a line and '#UC2.2' was starting at the next line.

After a bit of Googling we found out that there is another flag for the Java Pattern object which is 'Pattern.MULTILINE'. Apparently the default behaviour of '^' and '$' in Java regular expressions is to only match the beginning and end of the entire input... By passing the 'Pattern.MULTILINE' flag, this default behaviour is changed to the expected one of matching the beginning and end of every line.

Our resulting Java code was:

Pattern pattern = Pattern.compile("^#uc[0-9]+(\\.[0-9]+)*", 
Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);

We used the bitwise OR to use multiple flags.

Did you enjoy reading?

Share this blog with your audience!