CSC207 Software Design
Lectures
Regular Expressions

Motivation

How to count blank lines?

Most people consider a line with just spaces and tabs to be blank

Examining characters one by one is painful

Use regular expressions instead

Represent patterns as strings

*.txt is a regular expression

Warning: the notation is ugly

Only so many characters on the keyboard

Six Simple Patterns

Pattern Matches Explanation
a* '', 'a', 'aa', ... Zero or more
b+ 'b', 'bb', ... One or more
ab?c 'ac', 'abc' Optional (zero or one)
[abc] 'a', 'b', 'c' One from a set
[a-c] 'a', 'b', 'c' Abbreviation
[abc]* '', 'accb', ... Combination

How to Use in Python

Load the re module

Use re.search(pattern, text)

import sys, re
pat = sys.argv[1]
for text in sys.argv[2:]:
    if re.search(pat, text):
        result = "FOUND"
    else:
        result = "NOT FOUND"
    print pat, text, result

$ testMatch "a[bc]*" b ab accb add
a[bc]* b NOT FOUND
a[bc]* ab FOUND
a[bc]* accb FOUND
a[bc]* add FOUND

Note quotes around pattern on command line

Otherwise, shell tries to interpret the '*'

And notice that the pattern matches the last string

A pattern doesn't have to match all of text

a matches a, [bc]* matches zero length

Anchoring

Force position of match using anchors

^ matches beginning of line

$ matches end

Neither consumes any characters

Pattern Text Result
b+ abbc Matches
^b+ abbc Fails (no b at start)
^a*$ aabaa Fails (not all a's)

Escaping

Match actual ^ and $ using escape sequences \^ and \$

Must represent these in strings as "\\^" and "\\$"

Two layers of compilation:

Python/Java turn double backslashes into single backslash character

Regular expression library then compiles single backslash plus something into special operation

Use regular escape sequences for other special characters

"\t" is a tab character

Which matches a tab character

"\\t" is the two-character sequence \t

Which also matches a tab character

\t Tab
\n Newline
\* Asterisk
\\ Backslash
\b Break between word and space

Counting Blank Lines

import sys, re

# start of line, any number of spaces, tabs, carriage returns,
# and newlines, end of line
blank = "^[ \t\r\n]*$"

count = 0
for line in sys.stdin:
    if re.search(blank, line):
        count += 1
print count

Character Sets

Use escape sequences for common character sets

Remember: double backslash in source becomes single backslash in string

\d Digits [0-9]
\w Word [a-zA-Z0-9_]
\s Space [ \t\r\n]
. Anything except end-of-line [^\n]

Note: the notation [^abc] means "anything except the characters in this set"

Yes, the notation is confusing

Match Objects

Result of re.search() is a match object

mo.group() returns string that matched

mo.start() and mo.end() are the match's location

mo = re.search("b+", "abbcb")
print mo.group(), mo.start(), mo.end()
bb 1 3

Sub-Matches

All parenthesized sub-patterns are remembered

Text that matched Nth parentheses (counting from left) is group N

numbered = "\\s*(\\d+)\\s*:"
for line in sys.stdin:
    mo = re.search(numbered, line)
    if mo:
        num = mo.group(1)
        print num

Reverse Two Columns of Numbers

cols = "\\s*(\\d+)\\s+(\\d+)\s*"
for line in sys.stdin:
    mo = re.match(cols, line)
    if mo:
        a, b = mo.group(1), mo.group(2)
        print "%s\t%s" % (b, a)

Compiling

Regular expression library compiles patterns into more concise form for matching

Can improve performance by doing this once, and re-using the compiled form

nameCase = "[^A-Z]*([A-Z][a-z]*)(.*)"
matcher = re.compile(nameCase)
for line in sys.stdin:
    mo = matcher.search(line)
    while mo:
        print mo.group(1)
        mo = matcher.search(mo.group(2))

This is a sample document.  It has several words in name case on
the same line.  It was written in August of 2003.
This
It
It
August

How to Use in Java

The java.util.regex package contains:

Pattern: a compiled regular expression

Matcher: the result of a match

Typical usage:

public static String matchMiddle(String data) {
    String result = null;
    Pattern p = Pattern.compile("a(b|c)d");
    Matcher m = p.matcher(data);
    if (m.matches()) {
        result = m.group(1);
    }
    return result;
}

Other Patterns

Pattern Matches
a|b 'a', 'b'
ab|cd 'ab', 'cd'
a(b|c)d 'abd', 'acd'
a{2,3} 'aa', 'aaa'

Other Methods in Module

Module provides many other tools

split(pattern, string, max=all)

findall(pattern, string)

sub(old, new, string, count=all)

Examples for Self-Test

Make sure you understand why each of these does what it does

Pattern Data Result Groups
a a match -
- b fail -
a* a match -
- b match -
ab|cd ab match -
(ab|cd) ab match g1="ab"
- abcd match g1="ab"
ab* abbbb match -
- bbbbb fail -
a+b aaaab match -
- b fail -
\w* alex match -
- - match -
a?b?c? c match -
- abbc match -
1?[a-c]{2,4} abc match -
- 1abcc match -
ba{3,} babababa fail -
- baba fail -
th.*s the word that is match -
\d+ street|\d+\s\w+ 50 street match -
- 50 St George Street fail -
- 1 stgeorgestreet match -
- 1 streetstreet50 match -
\s*(\d+)([\w\s]*) 50 St George Street match g1="50", g2=" St George Street"
\s*(\d+)\s*([\w\s]*) 50 St George Street match g1="50", g2="St George Street"
a(b+(c|d))e abbce match g1="bbc", g2="c"
- abde match g1="bd", g2="d"
csc{1,1}\d{3,3}f|s\d csc207f1 match -
- csc209s fail -
(2*(3|4+)[2-4](a|3.*4)) 433ha14 match g1="433ha14", g2="4", g3="3ha14"
- 2343af4 match g1="2343af4", g2="3", g3="3af4"
(a(ab)*)* a match g1="a", g2="None"
- aabaab match g1="aab", g2="ab"
\w+\s+[a-z]+\s+=\s+\d+\s+; int i=5; fail -
- double digit = 3; fail -
- string name = test; fail -
[^a-y]+ z match -
- b fail -
(1)*2(3)+(4|6*) 1123 match g1="1", g2="3", g3=""
- 23334 match g1="None", g2="3", g3="4"
- 1112333346666 match g1="1", g2="3", g3="4"
(((123*2*)*)4)* 1212124 match g1="1212124", g2="121212", g3="12"
- 12333312341234 match g1="1234", g2="123", g3="123"
- 1232123 match g1="None", g2="None", g3="None"
\w+\@\w+\.com 123@123.com match -
- name9@utoronto.ca fail -
^v.*\s{,2}x+$ victor x match -
- va xt fail -

$Id: regexp.html,v 1.1.1.1 2004/01/04 05:02:31 reid Exp $