MIE453 - Bioinformatics Systems (Fall 06)

Tutorial 4 - Regular Expression

Contents

Overview of Regexp
Match operators and patterns
More on patterns
A Bigger Example
Another One

1. Overview of Regexp

Regular expression describe patterns in strings.
- A pattern may match many different strings.
- A pattern may only match a part of a string.
Use of regular expression in Perl
- In match operator m//, to match a string with a pattern
- In substitution operator s///, to change a string, if a match is found
- In translation operator tr///, to transliterate several characters into replacement characters throughout a string
Regular expression are case-sensitive (unless explicitly told otherwise)

2. Match operators and patterns

Match operator

Matches a string on the left of the operator with a pattern enclosed in the two forward slashes.

Syntax

$VAR =~ m/PATTERN/

Notice:

=~ is the binding operator: it binds a pattern match to a string
this expression returns a boolean value: True when a match is found, False otherwise.
if a match is found, the matched pattern is stored in the special variable $& (ampersand).
m is optional

Example: Match operator

while (<>) {
	chomp $_;
	if ($_ =~ /GAATTC/) {
		print "Found an EcoRI site! \n";
		$sites++;
	} elsif ($_ eq '') {
		last;
	}
}

Example: Match operator

$dna = 'cggcatgcaatattcc';
if ($dna =~ /ca*[gt]at*/) {
	print $&;
}

Substitution operator

Replace the part of the string that matches the specified pattern with supplied string

Syntax

$VAR =~ s/PATTERN/STR/g

Notice:

if a match is found, the matched pattern in $VAR is replaced by STR
g means global, i.e., replace every matched pattern (not just the first one)

Example: Transcribe DNA to RNA

#!/usr/bin/perl -w
# Transcribing DNA into RNA

# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";

print "$DNA\n\n";

# Transcribe the DNA to RNA by substituting all T's with U's.
$RNA = $DNA;

$RNA =~ s/T/U/g;

# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to RNA:\n\n";

print "$RNA\n";

exit;

Transliterate operator

Translate a set of characters into a new set of characters. Each character in the original set is translated into the corresponding character at the same potision in the new set.

Syntax

$VAR =~ tr/CHAR_LIST/NEW_CHAR_LIST/

Notice:

if a match is found, the matched pattern in $VAR is replaced by STR
g means global, i.e., replace every matched pattern (not just the first one)

Example: Reverse complement of DNA strand

#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA

# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

print "Here is the starting DNA:\n\n$DNA\n\n";

# Calculate the reverse complement
# XXX  Warning: this attempt will fail!  XXX
#
# First, copy the DNA into new variable $revcom 
# (short for REVerse COMplement)
# Notice that variable names can use lowercase letters like
# "revcom" as well as uppercase like "DNA".  
# In fact, lowercase is more common.
#
# It doesn't matter if we first reverse the string and then
# do the complementation; or if we first do the complementation
# and then reverse the string.  Same result each time.
# So when we make the copy we'll do the reverse in the same statement.

$revcom = reverse $DNA;

# Next substitute all bases by their complements,
# A->T, T->A, G->C, C->G

$revcom =~ s/A/T/g;
$revcom =~ s/T/A/g;
$revcom =~ s/G/C/g;
$revcom =~ s/C/G/g;

# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n$revcom\n";

# Our reverse complement should have all the bases in it, since the
# original DNA had all the bases-but ours only has A and G!
#
# Do you see why?
#
# The problem is that the first two substitute commands above change
# all the A's to T's (so there are no A's) and then all the
# T's to A's (so all the original A's and T's are all now A's).
# Same thing happens to the G's and C's all turning into G's.

print "\nThat was a bad algorithm, and the reverse complement was wrong!

\n";
print "Try again ... \n\n";

# Make a new copy of the DNA (see why we saved the original?)
$revcom = reverse $DNA;

# The Perl translate/transliterate command is just what we need:
$revcom =~ tr/ACGTacgt/TGCAtgca/;

print "Here is the reverse complement DNA:\n\n$revcom\n";

print "\nThis time it worked!\n\n";
exit;

Metacharacter

There two kinds of characters in regular expressions

Characters that match themselves, such as 'a', 'Z'
Metacharacters that have special meanings
- to match metacharacters literally, preced them with the backslash (\)

Escaping

A backslash \ before a metacharacter cause it to match itself

Example: /\\/ matches '\' in the string

Alternation

The pipe | indicates alternation

Example: /ab|cd/ matches 'ab' or 'cd'

Repetition

The Kleene star * after an item means that the item appears 0 or more times

Example: /ab*c/ matches 'ac', 'abc', 'abbbbbbc', etc

The + metacharacter after an item means that the item appear 1 or more thems

Example: /ab+c/ matches 'abc', 'abbbbbbc', etc, but not 'ac',

The ? metacharacter after an item means that the item appears 0 or 1 times

Example: /ab?c/ matches 'ac', 'abc', etc, but nothing else

The brace {} specify exactly the number times an item appears

Example: /ab{2, 4}c/ matches 'abbc', 'abbbc', or 'abbbbc', but nothing else

Minimal repetition match

Regular expression with repetition quantifiers (i.e. ?, +, * {MIN, }, {MIN, MAX}) match as many items as possible.

Example:

'abc a abc a abc a' =~ /abc.*a/;
print $&;

Output: abc a abc a abc a

Follow each of the repetition quantifiers with a ? to override this behavior to match as few items as possible.

Example:

'abc a abc a abc a' =~ /abc.*?a/;
print $&;

Output: abc a

Grouping

The parentheses () provide grouping

Example:

/a(b|c|d)e/ matches 'abe', 'ace' or 'ade'
/a(bc)*/ matches 'a', 'abc', 'abcbcbc', etc

Character Class

The square brackets [] specify a character class, which matches exactly one character as specified.

Example:

/[abc]/ matches 'a', 'b' or 'c'
/[a-zA-Z]/ matches any lowercase or uppercase letter
/[0-9]/ matches any digit
/[^0-9]/ matches any non-digit character
/./ matches any character except a newline

Matching Beginning and End

The caret ^ and dollar sign $, when used in regular expression don't match a character; rather then ^ asserts that the item that follows must be at the beginning of the string, $ asserts that the item that precedes it must be at the end of the string (or before the final newline).

Example:

/^abc/ matches any string that starts with 'abc'
/abc$/ matches any string that ends with 'abc'

Escaping

A backslash \ before a metacharacter cause it to match itself

Metasymbols

Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. They have special meaning in Perl regular expressions.

Symbol	Meaning
\r	Carriage return
\s	Any whitespace (\n, \t, \r)
\S	Any non-whitespace
\t	Tab character
\w	Any "word" character (alphanumeric plus _)
\W	Any nonword character
\d	Any digit

3. More on Patterns

Pattern Modifiers

Pattern modifiers are single-letter commands placed after the forward slashes. They are used to change the behavior of some regular expression features.

Modifier	Meaning
/i	Ignore uper- or lowercase distinctions
/s	Let . match newline
/g	Find all matches, not just the first one

Example: Find all matches

$dna = 'cggcatgcaatattcc';
print "DNA Sequence: $dna\n";
print "Pattern to match: c.{2}\n";
while ($dna =~ /c.{2}/g) {
	print "Matched pattern = ", $&," ending at position ", pos($dna)-1, "\n";
}

Notice: function pos always returns the position just past the match

Capturing Matched Patterns

Retrieve entire matached pattern

Use the special ampersand variable ($&) (see above).

Retrieve partial matached pattern

Use parenthese around parts of the pattern

can place as many pairs of parenthese in the regular expression
retrieve partial matched patterns using special variables $1, $2, ...
matches are numbered in order of the left-to-right appearce of their opening parenthesis

Example: Capture Matched Pattrens

$dna = 'cggcatgcaatattcc';
if ($dna =~ /(cg*(.*)ca*)(ta)/) {
	print "Entire match = ", $&, "\n";
	print "First pattern = ", $1, "\n"; 
	print "Second pattern = ", $2, "\n"; 
	print "Third pattern = ", $3, "\n";
}

Output:

Entire match = cggcatgcaata
First pattern = cggcatgcaa
Second pattern = catg
Third pattern = ta

4. A Bigger Example

Search data for a motif

#!/usr/bin/perl -w
# Searching for motifs

# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the keyboard
print "Please type the filename of the protein sequence data: ";

$proteinfilename = ;

# Remove the newline from the protein filename
chomp $proteinfilename;

# open the file, or exit
unless ( open(PROTEINFILE, $proteinfilename) ) {
    print "Cannot open file \"$proteinfilename\"\n\n";
    exit;
}

# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = <PROTEINFILE>;

# Close the file since we've read all the data into @protein now.
close PROTEINFILE;

# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$protein = join( '', @protein);

# Remove whitespace
$protein =~ s/\s//g;

# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
# Exit if no motif is entered.

do {
    print "Enter a motif (one line) to search for: ";
    $motif = ;
    chomp $motif;

    # Look for the motif
    if ( $protein =~ /$motif/ ) {
        print "Found it!\n\n";
    } else {
        print "Not found.\n\n";
    }
} until ( $motif =~ /^\s*$/ );

Output:

Please type the filename of the protein sequence data: fragment.pep
Enter a motif to search for: SVLQ
Found it!

Enter a motif to search for: JKL
Not found.

Enter a motif to search for:

Not found.

This perl script is adopted from the book Beginning Perl for Bioinformatics, James Tisdall, ISBN, 0-596-00080-4, 2001.

4. Another One

Parse a SWISS-PROT entry (for input into MySQL) (Sample entries: 1, 2, 3, Database Table: proteins)

#!/usr/bin/perl -w
# Searching for motifs
#! /usr/bin/perl -w
# get_proteins - given a list of SWISS-PROT files, extract data
# from them in preparation for importation into a database system.
#
# Note that the results produced are TAB-delimited.

use strict;
# $table_line - holds the tab-delimited line
# $code       - holds the protein code
# $species    - holds the species value

my ( $table_line, $code, $species );

# run the statements in the while loop until there are no line arriving from the standard input
while ( <> ) {
	# the current line is assigned to the Perl's default variable $_

	# the current is matched against the pattern to look for the ID line type
	# recall: ? override the behaviours of repetition quantifiers to match as few chars as possible
	# recall: use parentheses to retrieve parts of the matched pattern
	if ( /^ID   (.+)_(.+?) / ) {
		( $code, $species ) = ( $1, $2 );
	}


	# the current is matched against the pattern to look for the AC line type
	if ( /^AC   (.+?);/ ) {
		$table_line = $1 . "\t" . $code . "\t" . $species . "\t";
		# since are only care about the first AC line, we simply disgard the rests of them if present.
		while ( <> ) {
			last unless /^AC/;
		}
	}
	
	# the current is matched against the pattern to look for the LAST DT line type
	if ( /^DT/ ) {
		my $date_line = $_;
		# disgard all the DT lines except the last one
		while ( <> ) {
			last unless /^DT/;
			$date_line = $_;
		}
		$date_line =~ /^DT   (.+?) /;
		# use the subroutine "biodb2mysql" to convert the SWISS-PROT" date format into that of MySQL
		$table_line = $table_line . biodb2mysql( $1 ) . "\t";
	}
	
	# the current is matched against the pattern to look for ALL the DE line type
	if ( /^DE   (.+)/ ) {
		my $descr_lines = $1;
		# find and cancatenate all the DE lines
		while ( <> ) {
			last unless /^DE   (.+)/;
			$descr_lines = $descr_lines . ' ' . $1;
		}
		$table_line = $table_line . $descr_lines . "\t";
	}
	
	# the current is matched against the pattern to look for the SQ line type
	if ( /^SQ   (.+)/ ) {
		my $header = $1;
		# extract the sequence length from the header
		$header =~ /(\d+)/;
		$table_line = $table_line . $header . "\t" . $1 . "\t";
	}
	
	# the current is matched against the pattern to look for ALL sequence data
	if ( /^     (.+)/ ) {
		my $sequence_lines = $1;
		while ( <> ) {
			# the square brackets are used as delimiters around the "//" pattern
			# since they a forward slash character is the default pattern mataching delimiter
			if ( m[^//] ) {
				last;
			} else {
				/^     (.+)/;
				$sequence_lines = $sequence_lines . $1;
			}
		}
		$table_line = $table_line . $sequence_lines;
	}
	
	if ( m[^//] )
	{
		print "$table_line\n";
		$table_line = '';
	}
}

sub biodb2mysql {
    #
    # Given:  a date in DD-MMM-YYYY format.
    # Return: a date in YYYY-MM-DD format.
    #
    # Notes:  the returned date format is supported by MySQL.
    #

    my $original = shift;
    
    $original =~ /(\d\d)-(\w\w\w)-(\d\d\d\d)/;

    my ( $day, $month, $year ) = ( $1, $2, $3 );

    $month = '01' if $month eq 'JAN'; 
    $month = '02' if $month eq 'FEB'; 
    $month = '03' if $month eq 'MAR'; 
    $month = '04' if $month eq 'APR'; 
    $month = '05' if $month eq 'MAY'; 
    $month = '06' if $month eq 'JUN'; 
    $month = '07' if $month eq 'JUL'; 
    $month = '08' if $month eq 'AUG'; 
    $month = '09' if $month eq 'SEP'; 
    $month = '10' if $month eq 'OCT'; 
    $month = '11' if $month eq 'NOV'; 
    $month = '12' if $month eq 'DEC'; 
    
    return $year . '-' . $month . '-' . $day;
}

>perl t4-5.pl < Q52106.swp

Output:

Q52106  MERT    ACICA   2006-03-07      Mercuric transport protein (Mercury ion
transport protein).     SEQUENCE   116 AA;  12510 MW;  2930A92CF88EB10F CRC64;
116     MSEPQNGRGA LFAGGLAAIL ASACCLGPLV LIALGFSGAW IGNLTVLEPY RPIFIGAALVALFFAWR
RIV RPTAACKPGE VCAIPQVRTT YKLIFWFVAV LVLVALGFPY VMPFFY

This perl script is adopted from the book Bioinformatics, Biocomputing and Perl : an introduction to bioinformatics computing skills and practice, Chichester ; Hoboken, NJ : Wiley, 047085331X (pbk.), 2004