CSC401: Regular Expressions

Motivation

How to count blank lines?

Most people consider a line with just spaces and tabs to be blank

How to extract phone numbers from a text file?

Many valid formats: 416-111-2222, (416) 111 2222, etc. but not 4 161 112 222

Examining characters one by one is painful

Use regular expressions instead

Represent patterns as strings

Search for substrings that match patterns

Warning: the notation is ugly

Only so many characters on the keyboard

Matching Characters

Most characters will simply match themselves

Metacharacters are special, and don't match themselves.

Here is a complete list of metacharacters: . ^ $ * + ? { [ ] \ | ( )

Escape metacharacters to match a literal character

E.g. \+ matches a literal plus sign

Character classes

Character classes are specified by square brackets

[abc] or [a-c] will match any of the characters "a", "b", or "c"

[^a-c] matches everything except a to c

N.B. If you want to match a hyphen inside a character class, make sure to escape the hyphen!

Character Sets

Use escape sequences for common character sets

\d Digits [0-9]
\w Words [a-zA-Z0-9_]
\s Space [ \t\n\r\f\v]
. Anything except end-of-line [^\n]
\D Any non-digit character [^0-9]
\W Non-word character [^a-zA-Z0-9_]
\S Non-space character [^ \t\n\r\f\v]

Six Simple Patterns

Pattern Matches Explanation
a* '', 'a', 'aa', ... Zero or more
b+ 'b', 'bb', ... One or more
ab?c 'ac', 'abc' Optional (zero or one)
[abc] 'a', 'b', 'c' One from a set
[a-c] 'a', 'b', 'c' Abbreviation
[abc]* '', 'accb', ... Combination

How to Use in Python

Load the re module

Use re.search(pattern, text)

#!/bin/env python2.3
import sys, re
pat = sys.argv[1]
for text in sys.argv[2:]:
    if re.search(pat, text):
        result = "FOUND"
    else:
        result = "NOT FOUND"
    print pat, text, result
Save the above program as "testMatch.py", and run the below command.
$ testMatch "a.*a.+a" salamander aardvark snuffeluffegus
salamander FOUND
aardvark FOUND
snuffeluffegus NOT FOUND

Note quotes around pattern on command line

Otherwise, shell tries to interpret the '*'

And notice that a pattern doesn't have to match all of the text

Anchoring

Force position of match using anchors

^ matches beginning of line

$ matches end

Neither consumes any characters

Pattern Text Result
b+ abbc Matches
^b+ abbc Fails (no b at start)
^a*$ aabaa Fails (not all a's)

Can use \A and \Z (begin/end of string) instead of ^ and $ if no newlines in the string

Escaping

Match actual ^ and $ using escape sequences \^ and \$

Must represent these in strings as "\\^" and "\\$"

Two layers of compilation:

Python/Java turn double backslashes into single backslash character

Regular expression library then compiles single backslash plus something into special operation

Idea: Get rid of the first layer of compilation

We want our strings to be passed directly into the regular expression engine

To allow this, use "raw" strings

Raw strings leave all backslashes "as-is" in the string

So r"\n": is a literal backslash followed by an n

And r"\"" is a literal backslash followed by a quote

When you include this raw string as part of a regular expression, the regexp engine parses the backslash-n and the backslash-quote and converts them into newlines and quotes.

Common escape sequences

\t Tab
\n Newline
\* Asterisk
\\ Backslash
\b Break between word and space

Escaping Example

Double-backslashing can give you headaches, but it just looks scarier than it trully is

Here's one example where escaping is quite important:

import sys
import re
for line in sys.stdin:
    print line,
    if re.search("\section",line):
        print "FOUND PAT1"
    else:
        print "NOT FOUND PAT1"
    if re.search("\\section",line):
        print "FOUND PAT2"
    else:
        print "NOT FOUND PAT2"
    if re.search("\\\\section",line):
        print "FOUND PAT3"
    else:
        print "NOT FOUND PAT3"
    if re.search(r"\\section",line):
        print "FOUND PAT4"
    else:
        print "NOT FOUND PAT4"
For the input file:
this is a Latex file
\section{Section1}
The output will be:
this is a Latex file
NOT FOUND PAT1
NOT FOUND PAT2
NOT FOUND PAT3
NOT FOUND PAT4
\section{Section1}
NOT FOUND PAT1
NOT FOUND PAT2
FOUND PAT3
FOUND PAT4

And one where it doesn't make too much of a difference:

import sys
import re
for line in sys.stdin:
    print line,
    if re.search("\\t",line):
        print "FOUND PAT1"
    else:
        print "NOT FOUND PAT1"
    if re.search("\t",line):
        print "FOUND PAT2"
    else:
        print "NOT FOUND PAT2"
    if re.search(r"\t",line):
        print "FOUND PAT3"
    else:
        print "NOT FOUND PAT3"
    if re.search(r"\\t",line):
        print "FOUND PAT4"
    else:
        print "NOT FOUND PAT4"
For the input file:
3	5
3\t5
The output will be:
3	5
FOUND PAT1
FOUND PAT2
FOUND PAT3
NOT FOUND PAT4
3\t5
NOT FOUND PAT1
NOT FOUND PAT2
NOT FOUND PAT3
FOUND PAT4

Moral of the story?

Use r"" to keep RE readable!

Counting Blank Lines

import sys, re

# start of line, any number of spaces, tabs, carriage returns,
# and newlines, end of line
blank = r"^\s*$"

count = 0
for line in sys.stdin:
    if re.search(blank, line):
        count += 1
print count

Match Objects

Result of re.search() is a match object

mo.group() returns string that matched

mo.start() and mo.end() are the match's location

mo = re.search(r"b+", "abbcb")
print mo.group(), mo.start(), mo.end()
bb 1 3

Sub-Matches

All parenthesized sub-patterns are remembered

Text that matched Nth parentheses (counting from left) is group N

pat = r"\s*(\w+)\s+(\d+)\s*"
for line in sys.stdin:
    mo = re.search(pat,line)
    if mo:
        name = mo.group(1)
        grade = mo.group(2)
        print name,"has a score of",grade,"on A1"
Input file:
John 80
Anne	92
Dan		34
Output:
John has a score of 80 on A1
Anne has a score of 92 on A1
Dan has a score of 34 on A1

Reverse Two Columns of Numbers

cols = r"\s*(\d+)\s+(\d+)\s*"
for line in sys.stdin:
    mo = re.match(cols, line)
    if mo:
        a, b = mo.group(1), mo.group(2)
        print "%s\t%s" % (b, a)

Match vs. Search

import re
if re.match(r"box","inbox"):
    print "match succeeded"
else:
    print "match failed"    
if re.search(r"box","inbox"):
    print "search succeeded"
else:
    print "search failed"
match failed
search succeeded

Compiling

Regular expression library compiles patterns into more concise form for matching

Can improve performance by doing this once, and re-using the compiled form

The compiled pattern is an object that has RE methods

nameCase = r"[^A-Z]*([A-Z][a-z]*)(.*)"
matcher = re.compile(nameCase)
for line in sys.stdin:
    mo = matcher.search(line)
    while mo:
        print mo.group(1)
        mo = matcher.search(mo.group(2))

What does the program output given the following input:

Everybody loves CSC401.
Everybody
C
S
C

What does this program do?

It is supposed to print all of the words starting with capital letters.
... but it doesn't include numbers as part of a word.
Don't make this mistake on your assignments!

Other Patterns

Pattern Matches
a|b 'a', 'b'
ab|cd 'ab', 'cd'
a(b|c)d 'abd', 'acd'
(\w)\1 'aa', 'bb', 'zz', '11', ...
a{2,3} 'aa', 'aaa'

Search and Replace

re.sub(pattern=..., replacement=... , string=...)

Returns the string obtained by replacing every non-overlapping occurrence of the pattern in the string by the replacement.

If the pattern isn't found, string is returned unchanged.

Example: replace colour names with the word "colour"

>>> p.sub( '(blue|white|red)', 'colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub( '(blue|white|red)', 'colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'

Empty matches are replaced only when they're not adjacent to a previous match

>>> import re
>>> re.sub('x*','-', 'abxd')
'-a-b-d-'

You will need to expand clitics in assignment one.

Splitting strings

re.split(pattern=..., string=..)

Splits string into a list of substrings separated by the pattern in the regular expression.

Example

>>> import re
>>> re.split(r'\s+', 'The Cat in the Hat');
['The','Cat','in','the','Hat']
>>> re.split(r'(\s+)', 'The Cat in the Hat');
['The',' ', 'Cat', ' ', 'in', ' ', 'the', ' ', 'Hat']
>>> re.split(r'(a)(t)', 'The Cat in the Hat');
['The C', 'a', 't', ' in the H', 'a', 't', '']
>>> re.split(r'a(t)', 'The Cat in the Hat');
['The C', 't', ' in the H', 't', '']

Note: groups in patterns result in the matched sub-string being interleaved in the split list

Hint: This is good for tokenizing sentences

Another Split example

Remove whitespaces in time representation from a scheduler

(only whitespaces between a colon and a digit)

Input file:

Dentist 10: 30AM
CSC 401 8:15PM

Solution 1:

pat = re.compile('(:)\s+(\d+)')
for line in sys.stdin:
    print pat.split(line)
['Dentist 10', ':', '30', 'AM\n']
['CSC 401 8:15PM\n']

That didn't look too good, let's try Solution 2:

pat = re.compile('(:)\s+(\d+)')
for line in sys.stdin:
    print "".join(pat.split(line))
Dentist 10:30AM
CSC 401 8:15PM

And now Solution 3, using only REs:

pat1 = re.compile(':\s+\d')
def rem_spaces_in_hour(mo):
    return mo.group(0)[0]+mo.group(0)[-1]
for line in sys.stdin:
    print pat1.sub(rem_spaces_in_hour,line)
Dentist 10:30AM
CSC 401 8:15PM

Other Methods in Module

Module provides many other tools

findall(pattern, string)

finditer(pattern, string)

sub(old, new, string, count=all)

You'll need these for your assignment


Slides originally created by Greg Wilson. Initial adaptation for CSC401 by David James. Revisions by Michelle Craig, Michael Szamosi, Karen Reid, and David James. Revisions for CSC401 Winter 2006 by Cosmin Munteanu.