Pages

Tuesday 6 May 2014

Regular expressions in java

Hello!

I've been thinking what to blog about next time. I chose to investigate regular expressions. This seems to be very useful because they are frequently needed and can block you for a long time. 


Must confess that I have very little knowledge on regex. But as far as this article grows, our knowledge of regex grows.

Regular expression is a pattern describing some text. 
Regular expression engine is software that processes regular expressions by matching it to provided String.
The most basic regex is "a". It matches only the first occurrence of letter a.
Regex like "cat" matches c followed by a followed by t
But such regexes allow to match literal string only. What if we want to match characters that belong to some group(e.g. digits, spaces, uppercase letters)?
In such case we use special characters(called metacharacters). Let's see 12 metacharacters and their special meanings in regexes:
\ - to escape special meaning of metacharacters in regex.
^ - matches start of line or new-line character(\n).
$ - matches the end of line or new-line character(\n).
. - matches any character except new-line(\n).
| - matches as logical OR.
? - matches 0 or 1 occurrence of previous token.
* - matches 0 or any number of occurrences of previous token.
+ - matches 1 or more occurrences of previous token.
- indicates start of group.
- indicates end of group.
- indicates start of character class.
- indicates start of repetition operator.

For instance, to match string 1+1=2 the regex should be "1\+1=2". Note that we use \ to escape special meaning of +.

But backslash \ is also used to create shorthand character classes like \d or \w.  The former matches digits and the latter word characters.
The regex "X|Y" matches either X or Y.
Repetition operator indicates minimum or maximum quantity of occurrences.
"a{1,3}" matches string a or aa or aaa

Let's see how regex engine works internally.

Regex engine always returns the leftmost match. When applying a regex to string, the engine starts with the first character in string and tries to match it to all possible permutations. Only if all possibilities failed the engine goes to second character of string.
For example regex :
"cat" 
is applied to string :
He captured a catfish for his cat
Regex engine first tries to match first token of regex c to first letter of string H. This fails. Next engine matches c to e, then to space. Then to c in word captured. This match succeeds, thus engine takes next token of regex a and matches it to next letter a. This match succeeds, therefore engine matches t to p. This match fails. At this point engine knows that regex "cat" cannot be matched starting with forth letter c. That's why engine takes fifth letter a and matches it to regex "cat". And so on untill fifteenth letter c of word catfish. Here engines finds a match. At this point engine stops(!) because the first match is found!

Let's learn what character classes are.
Character class(or character set) allows to match one character out of range. If you want to match one of, you should place the whole range into square brackets like regex :
"gr[ae]y"
Such regex matches both gray and grey, but not graey or gry.
So, the character class matches only single character and the order of characters in class doesn't matter.
You can also use range in character class. The range is set by hyphen.
"[0-9]" matches single digit from 0 to 9.
"[0-9a-fX]" matches single character that can be either digit from 0 to 9 or letter from a to f or uppercase letter X.
With character classes it's easy to find misspelled word like with "sep[ae]r[ae]te".
It is common to use negated character class. For that you need to use ^. For instance :
"[^0-9]" matches any character except digit from 0 to 9. It also matches line-break. In order not to match line-break character and digit, the regex should be:
"[^0-9\r\n]".
Regex "q[^u]" has a match in "Iraq i". Because space is not a u.
There are several metacharacters inside character classes. They are ]\^-.
To use them literally you must escape them by backslash \.
There is no need to escape anything else. To match either star or asterisk, you use
"[+*]".
Hyphen inside character class indicates range only if it's set between characters.
If hyphen is first or last, it is matched literally. Regex "[k-]" matches either k or hyphen. Regex "[^k-]" matches any character that is not k or hyphen.
By using repetition operator after character class you repeat the character class but not matched character.
Regex "[0-9]+" means one or more digits and matches 2 and 22 and 345 and 3483893123.
Suppose we want to find all tokens separated by comma. The regex is "[^,]+". This regex means any one or more characters that is not a comma. But in java it is easier to use String's split method to split string by any regex.
Let's create a simple regex for date. The only problem is that separator might be space or slash or hyphen or dot. 
But we can handle this range of possible options with character class. The regex is 
\d\d[ /-.]\d\d[ /-.]\d\d
Such regex matches 13/12/22 and 13-12-22 and 13.12.22.
Another tricky task. Create regex that matches the text between double quotes(").
The regex is
"[^"]*"
This regex means that after first " we match any char(or no char) except " until second " is found. Note that ^ means that we match also line-break character.

To repeat the matched character you should use backreferences
Regex "([0-9])\\1+" means that we match single digit([0-9]) followed by the same digit because of \\1 and + means that digit matched by \\1 should be at least once or more.
It matches string "555" or "22" but not "456". It also matches 3333 in string "733337".
So, \\1 means the same token as matched by first group enclosed by round brackets - ().
Regex "([0-9])([,=-])\\2\\1" means find token that contains digit followed by , or = or - followed by the same character as matched by second group followed by the same digit as matched by first group. In string "1--13==30==11--0" it matches 1--1 and 3==3 but not 0==1, 1--0.
We can reuse capturing groups multiple times like in this regex:
(\\d)-(\\d) \\1 \\1 \\2 \\2
It matches "3-5 3 3 5 5".

Character class subtraction allows to specify characters which should be excluded from range of characters matched by character class.
Regex "[a-z&&[^bc]]" matches single character from a to z except b and c. The equivalent regex is "[ad-z]"
By using negation ^ in nested character class we excluded b and c from range of matching characters.
Regex "[0-9&&[^345]]" matches any digit except 3,4,5.

Character class intersection allows to match only characters which are present in both outer and nested character classes. It differs from subtraction by absence of negation ^ in nested class.
Regex "[a-z&&[bc]]" matches only b,c because they are present in both classes.
Regex "[0-9&&[345]]" matches only 3,4,5.

Character class union allows to create single character class from two or more character classes.
Regex "[a-z[01]]" matches characters from a to z plus digits 0 and 1. This regex is identical to "[a-z01]".

So, to subtract character class we use && and negation ^ in nested class. To intersect character classes we use &&. To union, we just add nested class.

There are also shorthand character classes:
\\d - is shorthand for [0-9].
\\w - is shorthand for [A-Za-z0-9_]. It's called word character and as you see it
matches all letters, digits and underscore ( _ ).
\\s - is shorthand for [ \t\r\n\f]. It matches any whitespace character.
\\D - is shorthand for [^0-9]. It matches non-digit characters.
\\W - is shorthand for [^A-Za-z0-9_]. It matches any character except letter,digit and underscore.
\\S - is shorthand for [^ \t\r\n\f]. It matches any non-space character.
The last three shorthands are negated versions of first three shorthands.

As we see backslash \ is used for escaping metacharacters and for creating shorthands.
Suppose, we want to match text preceded by backslash. Java requires bakslashes to be doubled - \\. For instance to express \w in java we need to write \\w.
To match backslash as literal we need to escape it. For that one must use regex "\\\\". The first two backslashes indicate escaping and last two indicate backslash. It matches string "\\".
So, in java to match text preceded by backslash(\\) one must write : "\\\\\\w+". First four backslashes match java backslash \\ and last part \\w+ matches word characters.

The dot. Dot or period matches any character except line-break.
To tell regex engine in java that dot should match line-breaks too, you do :
Pattern.compile("myRegex",Pattern.DOTALL);
The dot is not a metacharacter inside character class, so no need to escape it.

Anhors do not match characters. They match position before,after or between characters.
Known anchors are ^ and $. To match start spaces - "^\\s+" and to match trailing spaces - "\\s+$".
Besides DOTALL mode there is also MULTILINE mode. It means that your regex can match start and end of single line with ^ and $. Without MULTILINE mode engine will treat ^ and $ as start and end of entire input.
So the regex :
Pattern.compile("\\s+$",Pattern.MULTILINE);
allows to match whitespaces at the end of each line.

Another anchor is \\b. It matches start and end of word. Start of word is either at the
beginning of string or between non-word(\\W) and word(\\w) character. We know already that word character is any letter, digit and underscore. Anything else is non-word character.
For example, regex 
\\b4\\b
does not have a match in string "44" or "s4", but it has a match in string "s 4 a". Because space is non-word character. 
So, \\b matches between word character and non-word character or void!

Character classes match single character out of several possible characters. Alternation matches single regex out of several possible regexes. Alternation is applied by |.
Regex 
\\b(cat|dog|mouse)\\b
has three matches in string "A dog pursues a cat who pursues a mouse"
How to find all words starting with set or get? The regex is 
\\b(set|get)\\w+\\b
Suppose we need to find all words without digits and with digits only and mix.
The regex is 
(\\b[A-Za-z_-]+\\b)|(\\b\\d+\\b)|(\\b\\w+\\b)
It allows to break each word into group with letters only and with digits only and mix.
Each group is bounded by round brackets. There are 3 groups here. To determine which group next word belongs to, we can use :

Pattern pattern = Pattern.compile("(\\b[A-Za-z_-]+\\b)|(\\b\\d+\\b)|(\\b\\w+\\b)");
Matcher matcher = pattern.matcher("Quick brown fox 2014 jumps over lazy dog_2015");
 while (matcher.find()) {
   String group1 = matcher.group(1);
   if (group1 != null) {    
    // word includes letters and no digits
   }
   String group2 = matcher.group(2);
   if (group2 != null) {
    // word includes digits only
   }
   String group3 = matcher.group(3);
   if (group3 != null) {
    // word includes mix of letters and digits
   }
 }

Optional items. The question mark makes the preceding token optional. Regex 
colou?r
matches both "colour" and "color" because ? implies that u can appear 0 or 1 time.
Regex 
Nov(ember)?
matches both "Nov" and "November".
? is metacharacter that is greedy meaning that it always try to match token before question mark and only if it fails then regex engine omits this token. In string 
"It's Feb 23rd,2003"
the regex
Feb 23(rd)?
matches "Feb 23rd" but not "Feb 23". Because ? is greedy.
To turn off greediness you need to put ? after greedy metacharacter! So, the regex
Feb 23(rd)??
matches "Feb 23" but not "Feb 23rd". Because ? turned off greediness of previous ?.

There are three main repetition operators - ? ,+, *. If you want to set exact number of repetitions you should use curly braces {min,max}. E.g. regex a{1,3}
matches "a","aa","aaa" because a can be repeated 1, 2 or 3 times. Regex 
b{1,}
means that b can be repeated 1 or more times. Therefore 
? is equivalent to {0,1}
+ is equivalent to {1,}
* is equivalent to {0,}
To match number from 100 to 9999 the regex is
\\b[1-9][0-9]{2,3}\\b

Let's note that {},?,+ and * are all greedy operators, that is they try to match as much as possible.
In string 
"This is the first test"
the regex 
.+
will match whole string. But regex 
.+?
will match only first letter Q. Each call of matcher.find(); followed by 
matcher.group(); returns next character of input string.

Suppose we need to match all tokens wrapped by square brackets []. Regex is:
\\[.+\\]
If it's applied to string 
"This was lasting from [Dec,20 2012] to [Jan,20 2013]!"
the result is 
[Dec,20 2012] to [Jan,20 2013]
This happened because dot(.) matches any character and plus(+) tells to match such chars until the end of string. So when end of string is reached regex engine backtracks to the last found closing bracket ] and returns a match.
To avoid this behavior it's needed to turn off greediness by ?. Lazy regex is :
\\[.+?\\]
Result is two separate matches :
[Dec,20 2012]
[Jan,20 2013]
Regex engine now finds only one char by .+? and searches input for next char in regex that is ]. So engine fails to match ] after each character in string "Dec,20 201". 
An alternative approach is to use negated character class as with double qoutes above. Such regex is:
"\\[[^\\]]+\\]"
This option is better than with lazy plus.

Capturing group allows to group part of regex to be used by repetition operators(+,*,?,{}) or alternation(|). They are created by round brackets (). Group number is equal to count of opening bracket from the left. Regex :
((lasting)|(was))
contains three groups. First starts by first (, second - by second ( and third group starts by third (.  In java, match in specific group can be obtained by matcher.split(int groupNo). To understand groups better, run this code :
Pattern pattern = Pattern.compile("((lasting)|(was))");
Matcher matcher = pattern.matcher("This was lasting!");
matcher.find();
String group3 = matcher.group(3);
matcher.find();  
String group2 = matcher.group(2);
System.out.println("group3=" + group3);   
System.out.println("group2=" + group2);
The result is:
group3=was
group2=lasting
But java also provides non-capturing groups. Such group is matched but it doesn't have its number and its match cannot be queried by matcher.group(int groupNo). To create non-capturing group we use (?: as here:
(?:lasting)|(was)
Example is :
Pattern pattern = Pattern.compile("((?:lasting)|(was))");
Matcher matcher = pattern.matcher("This was lasting!");
matcher.find();
String group2 = matcher.group(2);
matcher.find();  
String group1 = matcher.group(1);
System.out.println("group2=" + group2);   
System.out.println("group1=" + group1);
The result is :
group2=was
group1=lasting
Now there are only 2 groups :
1. ((?:lasting)|(was))
2. (was)
(?:lasting) doesn't count as a group!

Backreferences. They allow to match text previously matched by capturing group ().
To match a pair of opening and closing HTML tags we can use backreference to reuse matched name of opening tag. Pair of HTML tags always have the same name. The regex is :
<([a-z][a-z0-9]*)\\b[^>]*>.*?</\\1>
This regex matches tags with attributes in opening tag. Given string:
"This <boo class='booFont'>was <em>lasting</em></boo>!"
Regex matches 
<boo class='booFont'>was <em>lasting</em></boo>

[a-z] ensures that opening tag has at least one letter.
[a-z0-9]* ensures that there are any number of letters or digits in tag name.
([a-z][a-z0-9]*) denotes capturing group whose match is reused by \\1 to have the same name in closing bracket.
\\b is used to eliminate backtracking of [a-z0-9]*. Backtracking is applied when whole match fails. It causes to truncate matched tag name boo to bo at first backtrack and even to b at second backtrack. This truncating can cause to match successfully strings like 
<boo class='booFont'>was <em>lasting</em></bo>
or 
<boo class='booFont'>was <em>lasting</em></b>
They match. Because when \\b is absent backtracking causes [a-z0-9]* to match bo and next part of regex [^>]* matches whatever until closing bracket >. In these strings \\1 match fails at first and then backtracking starts to truncate what was matched by [a-z0-9]* that is from boo to bo and to b.
>.*?< matches whatever after > and before first <. Lazy .* stops to match when first < is found.
/ indicates that it should be closing tag.
\\1 means to match exactly the same text as matched by first capturing group ([a-z][a-z0-9]*). When backtracking first capturing group can match bo but \\b that goes next doesn't match o that goes after bo in input string. This prevents to match opening tag name boo and closing tag name bo or b.


Backreferences use the last saved match of capturing group. Regex 
([abc]+) 
matches "cab" and saves cab into backreference. But regex
([abc])+
matches "cab" and saves b into backreference because b was last matched.

As exercise you can think about regex that matches two the same words separated by space as in string "This was last last time!".

Atomic group is a group to which engine doesn't backtrack. It means that it doesn't do alternation or it doesn't match less characters with greedy quantifiers. Engine doesn't return to atomic group unlike to usual group! If the first match failed engine won't try alternatives. Example will clarify this. Regex with normal group:
([0-9]+?)w
matches string
22w
But the same regex with atomic group
(?>[0-9]+?)w
matches only last two chars
2w
In first regex, engine consumes first digit 2 and goes to match w with next character of input string - 2. This fails. Engine backtracks to group and matches second 2 by [0-9]+. Finally engine matches w to w. Because + is lazy, regex engine consumes only one digit and goes to match next token of regex w with second 2. This fails and engine backtracks to consume second 2.
But in case with atomic group, engine cannot backtrack. That's why, it fails to match from first char and takes next char of input string. So, it tries to match 2w to (?>[0-9]+?)w which succeeds
Another example with atomic group. Regex with normal group:
\\b(Uk|Ukr|Ukraine)\\b
matches string
Ukraine
but regex with atomic group
\\b(?>Uk|Ukr|Ukraine)\\b
doesn't match, because engine doesn't try alternatives. It takes the first option Uk which fails to match "Ukraine" and stops.

Possessive quantifiers work as greedy quantifiers in that they consume as many tokens as possible, but they don't allow engine to backtrack that is truncate matched tokens.
Greedy quantifiers match as many tokens as possible and if overall match fails they truncate(throw away) matched tokens until overall match is found.
Lazy quantifiers match as few tokens as possible and then expand until overall match is found.
Possessive quantifiers match as many tokens as possible, but don't allow to truncate matched tokens. So possessive quantifiers mean either all or nothing. To create possessive quantifier we append + to ?, *, +, {}.
So when we are sure that we need to match as many tokens as possible we can use possessive quantifiers. This regex can be appropriate:
"[^"]*"
It matches whatever between double quotes. [^"] ensures, that we never skip double quote. But if closing double qoute is absent engine starts to backtrack each letter up to opening double quote. To save time we can use such regex:
"[^"]*+"
It ensures that engine never backtracks saving lot's of time.
In previous regex :
<([a-z][a-z0-9]*)\\b[^>]*>.*?</\\1>
we experienced problems with backtracking of [a-z0-9]*. To prevent them we used word boundary \\b.
But more efficient solution is to make [a-z0-9]* possessive:
<([a-z][a-z0-9]*+)[^>]*>.*?</\\1>
This works the same as with \\b but quicker. Now it's impossible for engine to change opening tag name.

You may still wonder how backtracking works. Let's see example of catastrophic backtracking. Regex is:
([a-z]+)([a-z]+)*!
Input string is :
abcdeefghjklmnoprstuwvxyzabcdeefghjklmnoprstuwvxyz
If you try to match this you will have to wait long hours for Matcher to find a match. The reason is backtracking.
([a-z]+) will match all characters in the input string.
([a-z]+)* will match nothing since all characters where matched by previous group. But this is ok because of *.
! will not match since the end of string has been reached. 
Thus whole regex finds no match. 
And here backtracking begins. Engine backtracks first group because second group matched nothing. Engine takes token matched by first group(([a-z]+)) and releases token's last symbol. In our case the last is the last of input string that is z. Next, engine tries to match by second group ([a-z]+)*. It matches released z. Next engine tries to match ! and again fails. Now engine backtracks first group again causing second group to match yz. But it again fails on exclamation mark !. Now engine can backtrack second group, because it matched yz, while it is valid to match only z. So, second group matched only y. But because of asterisk * second group runs again and matches z. Engine again fails on exclamation mark ! and begins to backtrack first group. This is the third time, first group backtracks, thus it releases xyz. xyz is matched by second group. As engine fails on !, it starts to backtrack token xyz of second group. So, it lasts for a very long time for engine to finish backtracking of first group. After 20 minutes of waiting I terminated vm.
The solution to catastrophic backtracking is to use either possessive quantifiers or atomic group. Recall, that they both prohibit backtracking.
To be efficient in regexes you must understand that quantifiers(*,+,?,{}) can backtrack! Lazy quantifiers also backtrack but by expanding their match.

Lookahead and lookbehind are zero-length assertions like start or end of line or word. They allow you to assert that input string contains text matching pattern. They don't consume characters but only return true or false. Lookahead is not treated as capturing group. 
Lookahead and lookbehind can be positive or negative.

Positive lookahead means assert that following(next) text matches pattern. To create it, you use (?=myPattern) construct.
We can check if q is followed by a with positive lookahead. Regex :
q(?=a)[a-z]
matches string "qa". So, (?=a) asserts that text matched by [a-z] is a.
But regex 
q(?=a)
doesn't match "qa". But it matches first char q because it is followed by a which is asserted by (?=a). So, a is not included into match, because lookahead only asserts to true or false, but doesn't consume characters.
What if regex is:
q(?=a)z
and string is:
qaz
Regex fails to match qaz. Because after positive lookahead returned true, engine tries to match z of regex to a of string and fails. a in string wasn't consumed by lookahead.
Suppose, I need to create regex to match password with such rules:
1. at least one capital letter at any position;
2. at least one of [. -_] at any position;
3. at least one digit at any position
4. length is at least 5 characters
The regex is :
^(?=.*?[A-Z])(?=.*?[0-9])(?=.*?[.,_-]).{5,}$
It has 3 positive lookaheads:
(?=.*?[A-Z]) ensures that following string contains one capital letter
(?=.*?[0-9]) ensures that following string contains one digit
(?=.*?[.,_-]) ensures that following string contains one of [.,_-]
.{5,} ensures that string contains any character and at least 5.

Negative lookahead means assert that following text doesn't match pattern. To create it, you use (?!myPattern) construct.
We can check if q is not followed by a with such regex:
q(?!a)[a-z]
matches string "qb".  
To assert that input string doesn't contain line-break characters we use regex:
^(?!.*?\r\n).+$
It is very similar to regex for password. It asserts, that string has at least one char by .+ and doesn't contain line-break at any position - (?!.*?\r\n).

No comments:

Post a Comment