Motivation
How to count blank lines?
Most people consider a line with just spaces and tabs to be blank
Examining characters one by one is painful
Use regular expressions instead
Represent patterns as strings
*.txt is a regular expression
Warning: the notation is ugly
Only so many characters on the keyboard
Six Simple Patterns
| Pattern | Matches | Explanation |
a* |
'', 'a', 'aa', ... |
Zero or more |
b+ |
'b', 'bb', ... |
One or more |
ab?c |
'ac', 'abc' |
Optional (zero or one) |
[abc] |
'a', 'b', 'c' |
One from a set |
[a-c] |
'a', 'b', 'c' |
Abbreviation |
[abc]* |
'', 'accb', ... |
Combination |
How to Use in Python
Load the re module
Use re.search(pattern, text)
import sys, re
pat = sys.argv[1]
for text in sys.argv[2:]:
if re.search(pat, text):
result = "FOUND"
else:
result = "NOT FOUND"
print pat, text, result
$ testMatch "a[bc]*" b ab accb add
a[bc]* b NOT FOUND
a[bc]* ab FOUND
a[bc]* accb FOUND
a[bc]* add FOUND
Note quotes around pattern on command line
Otherwise, shell tries to interpret the '*'
And notice that the pattern matches the last string
A pattern doesn't have to match all of text
a matches a, [bc]* matches zero length
Anchoring
Force position of match using anchors
^ matches beginning of line
$ matches end
Neither consumes any characters
| Pattern | Text | Result |
b+ |
abbc |
Matches |
^b+ |
abbc |
Fails (no b at start) |
^a*$ |
aabaa |
Fails (not all a's) |
Escaping
Match actual ^ and $ using escape sequences \^ and \$
Must represent these in strings as "\\^" and "\\$"
Two layers of compilation:
Python/Java turn double backslashes into single backslash character
Regular expression library then compiles single backslash plus something into special operation
Use regular escape sequences for other special characters
"\t" is a tab character
Which matches a tab character
"\\t" is the two-character sequence \t
Which also matches a tab character
\t |
Tab |
\n |
Newline |
\* |
Asterisk |
\\ |
Backslash |
\b |
Break between word and space |
Counting Blank Lines
import sys, re
# start of line, any number of spaces, tabs, carriage returns,
# and newlines, end of line
blank = "^[ \t\r\n]*$"
count = 0
for line in sys.stdin:
if re.search(blank, line):
count += 1
print count
Character Sets
Use escape sequences for common character sets
Remember: double backslash in source becomes single backslash in string
\d |
Digits | [0-9] |
\w |
Word | [a-zA-Z0-9_] |
\s |
Space | [ \t\r\n] |
. |
Anything except end-of-line | [^\n] |
Note: the notation [^abc] means "anything except the characters in this set"
Yes, the notation is confusing
Match Objects
Result of re.search() is a match object
mo.group() returns string that matched
mo.start() and mo.end() are the match's location
mo = re.search("b+", "abbcb")
print mo.group(), mo.start(), mo.end()
bb 1 3
Sub-Matches
All parenthesized sub-patterns are remembered
Text that matched Nth parentheses (counting from left) is group N
numbered = "\\s*(\\d+)\\s*:"
for line in sys.stdin:
mo = re.search(numbered, line)
if mo:
num = mo.group(1)
print num
Reverse Two Columns of Numbers
cols = "\\s*(\\d+)\\s+(\\d+)\s*"
for line in sys.stdin:
mo = re.match(cols, line)
if mo:
a, b = mo.group(1), mo.group(2)
print "%s\t%s" % (b, a)
Compiling
Regular expression library compiles patterns into more concise form for matching
Can improve performance by doing this once, and re-using the compiled form
nameCase = "[^A-Z]*([A-Z][a-z]*)(.*)"
matcher = re.compile(nameCase)
for line in sys.stdin:
mo = matcher.search(line)
while mo:
print mo.group(1)
mo = matcher.search(mo.group(2))
This is a sample document. It has several words in name case on
the same line. It was written in August of 2003.
This
It
It
August
How to Use in Java
The java.util.regex package contains:
Pattern: a compiled regular expression
Matcher: the result of a match
Typical usage:
public static String matchMiddle(String data) {
String result = null;
Pattern p = Pattern.compile("a(b|c)d");
Matcher m = p.matcher(data);
if (m.matches()) {
result = m.group(1);
}
return result;
}
Other Patterns
| Pattern | Matches |
a|b |
'a', 'b' |
ab|cd |
'ab', 'cd' |
a(b|c)d |
'abd', 'acd' |
a{2,3} |
'aa', 'aaa' |
Other Methods in Module
Module provides many other tools
split(pattern, string, max=all)
findall(pattern, string)
sub(old, new, string, count=all)
Examples for Self-Test
Make sure you understand why each of these does what it does
| Pattern | Data | Result | Groups |
a |
a | match | - |
| - | b | fail | - |
a* |
a | match | - |
| - | b | match | - |
ab|cd |
ab | match | - |
(ab|cd) |
ab | match | g1="ab" |
| - | abcd | match | g1="ab" |
ab* |
abbbb | match | - |
| - | bbbbb | fail | - |
a+b |
aaaab | match | - |
| - | b | fail | - |
\w* |
alex | match | - |
| - | - | match | - |
a?b?c? |
c | match | - |
| - | abbc | match | - |
1?[a-c]{2,4} |
abc | match | - |
| - | 1abcc | match | - |
ba{3,} |
babababa | fail | - |
| - | baba | fail | - |
th.*s |
the word that is | match | - |
\d+ street|\d+\s\w+ |
50 street | match | - |
| - | 50 St George Street | fail | - |
| - | 1 stgeorgestreet | match | - |
| - | 1 streetstreet50 | match | - |
\s*(\d+)([\w\s]*) |
50 St George Street | match | g1="50", g2=" St George Street" |
\s*(\d+)\s*([\w\s]*) |
50 St George Street | match | g1="50", g2="St George Street" |
a(b+(c|d))e |
abbce | match | g1="bbc", g2="c" |
| - | abde | match | g1="bd", g2="d" |
csc{1,1}\d{3,3}f|s\d |
csc207f1 | match | - |
| - | csc209s | fail | - |
(2*(3|4+)[2-4](a|3.*4)) |
433ha14 | match | g1="433ha14", g2="4", g3="3ha14" |
| - | 2343af4 | match | g1="2343af4", g2="3", g3="3af4" |
(a(ab)*)* |
a | match | g1="a", g2="None" |
| - | aabaab | match | g1="aab", g2="ab" |
\w+\s+[a-z]+\s+=\s+\d+\s+; |
int i=5; | fail | - |
| - | double digit = 3; | fail | - |
| - | string name = test; | fail | - |
[^a-y]+ |
z | match | - |
| - | b | fail | - |
(1)*2(3)+(4|6*) |
1123 | match | g1="1", g2="3", g3="" |
| - | 23334 | match | g1="None", g2="3", g3="4" |
| - | 1112333346666 | match | g1="1", g2="3", g3="4" |
(((123*2*)*)4)* |
1212124 | match | g1="1212124", g2="121212", g3="12" |
| - | 12333312341234 | match | g1="1234", g2="123", g3="123" |
| - | 1232123 | match | g1="None", g2="None", g3="None" |
\w+\@\w+\.com |
123@123.com | match | - |
| - | name9@utoronto.ca | fail | - |
^v.*\s{,2}x+$ |
victor x | match | - |
| - | va xt | fail | - |
$Id: regexp.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $