Contents
Match operator
Matches a string on the left of the operator with a pattern enclosed in the two forward slashes.
Syntax
$VAR =~ m/PATTERN/
Notice:
Example: Match operator
while (<>) { chomp $_; if ($_ =~ /GAATTC/) { print "Found an EcoRI site! \n"; $sites++; } elsif ($_ eq '') { last; } }
Example: Match operator
$dna = 'cggcatgcaatattcc'; if ($dna =~ /ca*[gt]at*/) { print $&; }
Substitution operator
Replace the part of the string that matches the specified pattern with supplied string
Syntax
$VAR =~ s/PATTERN/STR/g
Notice:
Example: Transcribe DNA to RNA
#!/usr/bin/perl -w # Transcribing DNA into RNA # The DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; exit;
Transliterate operator
Translate a set of characters into a new set of characters. Each character in the original set is translated into the corresponding character at the same potision in the new set.
Syntax
$VAR =~ tr/CHAR_LIST/NEW_CHAR_LIST/
Notice:
Example: Reverse complement of DNA strand
#!/usr/bin/perl -w # Calculating the reverse complement of a strand of DNA # The DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "Here is the starting DNA:\n\n$DNA\n\n"; # Calculate the reverse complement # XXX Warning: this attempt will fail! XXX # # First, copy the DNA into new variable $revcom # (short for REVerse COMplement) # Notice that variable names can use lowercase letters like # "revcom" as well as uppercase like "DNA". # In fact, lowercase is more common. # # It doesn't matter if we first reverse the string and then # do the complementation; or if we first do the complementation # and then reverse the string. Same result each time. # So when we make the copy we'll do the reverse in the same statement. $revcom = reverse $DNA; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n$revcom\n"; # Our reverse complement should have all the bases in it, since the # original DNA had all the bases-but ours only has A and G! # # Do you see why? # # The problem is that the first two substitute commands above change # all the A's to T's (so there are no A's) and then all the # T's to A's (so all the original A's and T's are all now A's). # Same thing happens to the G's and C's all turning into G's. print "\nThat was a bad algorithm, and the reverse complement was wrong! \n"; print "Try again ... \n\n"; # Make a new copy of the DNA (see why we saved the original?) $revcom = reverse $DNA; # The Perl translate/transliterate command is just what we need: $revcom =~ tr/ACGTacgt/TGCAtgca/; print "Here is the reverse complement DNA:\n\n$revcom\n"; print "\nThis time it worked!\n\n"; exit;
Metacharacter
There two kinds of characters in regular expressions
Escaping
A backslash \ before a metacharacter cause it to match itself
Example: /\\/ matches '\' in the string
Alternation
The pipe | indicates alternation
Example: /ab|cd/ matches 'ab' or 'cd'
RepetitionThe Kleene star * after an item means that the item appears 0 or more times
The + metacharacter after an item means that the item appear 1 or more thems
The ? metacharacter after an item means that the item appears 0 or 1 times
The brace {} specify exactly the number times an item appears
Minimal repetition match
Regular expression with repetition quantifiers (i.e. ?, +, * {MIN, }, {MIN, MAX}) match as many items as possible.
Example:
'abc a abc a abc a' =~ /abc.*a/; print $&;
Output: abc a abc a abc a
Follow each of the repetition quantifiers with a ? to override this behavior to match as few items as possible.
Example:
'abc a abc a abc a' =~ /abc.*?a/; print $&;
Output: abc a
Grouping
The parentheses () provide grouping
Example:
Character Class
The square brackets [] specify a character class, which matches exactly one character as specified.
Example:
Matching Beginning and End
The caret ^ and dollar sign $, when used in regular expression don't match a character; rather then ^ asserts that the item that follows must be at the beginning of the string, $ asserts that the item that precedes it must be at the end of the string (or before the final newline).
Example:
Escaping
A backslash \ before a metacharacter cause it to match itself
Metasymbols
Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. They have special meaning in Perl regular expressions.
Symbol | Meaning |
\r | Carriage return |
\s | Any whitespace (\n, \t, \r) |
\S | Any non-whitespace |
\t | Tab character |
\w | Any "word" character (alphanumeric plus _) |
\W | Any nonword character |
\d | Any digit |
Pattern Modifiers
Pattern modifiers are single-letter commands placed after the forward slashes. They are used to change the behavior of some regular expression features.
Modifier | Meaning |
/i | Ignore uper- or lowercase distinctions |
/s | Let . match newline |
/g | Find all matches, not just the first one |
Example: Find all matches
$dna = 'cggcatgcaatattcc'; print "DNA Sequence: $dna\n"; print "Pattern to match: c.{2}\n";
while ($dna =~ /c.{2}/g) {
print "Matched pattern = ", $&," ending at position ", pos($dna)-1, "\n";
}
Notice: function pos always returns the position just past the match
Capturing Matched Patterns
Retrieve entire matached pattern
Use the special ampersand variable ($&) (see above).
Retrieve partial matached pattern
Use parenthese around parts of the pattern
Example: Capture Matched Pattrens
$dna = 'cggcatgcaatattcc';
if ($dna =~ /(cg*(.*)ca*)(ta)/) {
print "Entire match = ", $&, "\n";
print "First pattern = ", $1, "\n";
print "Second pattern = ", $2, "\n";
print "Third pattern = ", $3, "\n";
}
Output:
Entire match = cggcatgcaata
First pattern = cggcatgcaa
Second pattern = catg
Third pattern = ta
#!/usr/bin/perl -w # Searching for motifs # Ask the user for the filename of the file containing # the protein sequence data, and collect it from the keyboard print "Please type the filename of the protein sequence data: "; $proteinfilename =; # Remove the newline from the protein filename chomp $proteinfilename; # open the file, or exit unless ( open(PROTEINFILE, $proteinfilename) ) { print "Cannot open file \"$proteinfilename\"\n\n"; exit; } # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = <PROTEINFILE> ; # Close the file since we've read all the data into @protein now. close PROTEINFILE; # Put the protein sequence data into a single string, as it's easier # to search for a motif in a string than in an array of # lines (what if the motif occurs over a line break?) $protein = join( '', @protein); # Remove whitespace $protein =~ s/\s//g; # In a loop, ask the user for a motif, search for the motif, # and report if it was found. # Exit if no motif is entered. do { print "Enter a motif (one line) to search for: "; $motif = ; chomp $motif; # Look for the motif if ( $protein =~ /$motif/ ) { print "Found it!\n\n"; } else { print "Not found.\n\n"; } } until ( $motif =~ /^\s*$/ );
Output:
Please type the filename of the protein sequence data: fragment.pep
Enter a motif to search for: SVLQ
Found it!
Enter a motif to search for: JKL
Not found.
Enter a motif to search for:
Not found.
This perl script is adopted from the book Beginning Perl for Bioinformatics, James Tisdall, ISBN, 0-596-00080-4, 2001.
Parse a SWISS-PROT entry (for input into MySQL) (Sample entries: 1, 2, 3, Database Table: proteins)
#!/usr/bin/perl -w # Searching for motifs #! /usr/bin/perl -w # get_proteins - given a list of SWISS-PROT files, extract data # from them in preparation for importation into a database system. # # Note that the results produced are TAB-delimited. use strict; # $table_line - holds the tab-delimited line # $code - holds the protein code # $species - holds the species value my ( $table_line, $code, $species ); # run the statements in the while loop until there are no line arriving from the standard input while ( <> ) { # the current line is assigned to the Perl's default variable $_ # the current is matched against the pattern to look for the ID line type # recall: ? override the behaviours of repetition quantifiers to match as few chars as possible # recall: use parentheses to retrieve parts of the matched pattern if ( /^ID (.+)_(.+?) / ) { ( $code, $species ) = ( $1, $2 ); } # the current is matched against the pattern to look for the AC line type if ( /^AC (.+?);/ ) { $table_line = $1 . "\t" . $code . "\t" . $species . "\t"; # since are only care about the first AC line, we simply disgard the rests of them if present. while ( <> ) { last unless /^AC/; } } # the current is matched against the pattern to look for the LAST DT line type if ( /^DT/ ) { my $date_line = $_; # disgard all the DT lines except the last one while ( <> ) { last unless /^DT/; $date_line = $_; } $date_line =~ /^DT (.+?) /; # use the subroutine "biodb2mysql" to convert the SWISS-PROT" date format into that of MySQL $table_line = $table_line . biodb2mysql( $1 ) . "\t"; } # the current is matched against the pattern to look for ALL the DE line type if ( /^DE (.+)/ ) { my $descr_lines = $1; # find and cancatenate all the DE lines while ( <> ) { last unless /^DE (.+)/; $descr_lines = $descr_lines . ' ' . $1; } $table_line = $table_line . $descr_lines . "\t"; } # the current is matched against the pattern to look for the SQ line type if ( /^SQ (.+)/ ) { my $header = $1; # extract the sequence length from the header $header =~ /(\d+)/; $table_line = $table_line . $header . "\t" . $1 . "\t"; } # the current is matched against the pattern to look for ALL sequence data if ( /^ (.+)/ ) { my $sequence_lines = $1; while ( <> ) { # the square brackets are used as delimiters around the "//" pattern # since they a forward slash character is the default pattern mataching delimiter if ( m[^//] ) { last; } else { /^ (.+)/; $sequence_lines = $sequence_lines . $1; } } $table_line = $table_line . $sequence_lines; } if ( m[^//] ) { print "$table_line\n"; $table_line = ''; } } sub biodb2mysql { # # Given: a date in DD-MMM-YYYY format. # Return: a date in YYYY-MM-DD format. # # Notes: the returned date format is supported by MySQL. # my $original = shift; $original =~ /(\d\d)-(\w\w\w)-(\d\d\d\d)/; my ( $day, $month, $year ) = ( $1, $2, $3 ); $month = '01' if $month eq 'JAN'; $month = '02' if $month eq 'FEB'; $month = '03' if $month eq 'MAR'; $month = '04' if $month eq 'APR'; $month = '05' if $month eq 'MAY'; $month = '06' if $month eq 'JUN'; $month = '07' if $month eq 'JUL'; $month = '08' if $month eq 'AUG'; $month = '09' if $month eq 'SEP'; $month = '10' if $month eq 'OCT'; $month = '11' if $month eq 'NOV'; $month = '12' if $month eq 'DEC'; return $year . '-' . $month . '-' . $day; }
>perl t4-5.pl < Q52106.swp
Output:
Q52106 MERT ACICA 2006-03-07 Mercuric transport protein (Mercury ion transport protein). SEQUENCE 116 AA; 12510 MW; 2930A92CF88EB10F CRC64; 116 MSEPQNGRGA LFAGGLAAIL ASACCLGPLV LIALGFSGAW IGNLTVLEPY RPIFIGAALVALFFAWR RIV RPTAACKPGE VCAIPQVRTT YKLIFWFVAV LVLVALGFPY VMPFFY
This perl script is adopted from the book Bioinformatics, Biocomputing and Perl : an introduction to bioinformatics computing skills and practice, Chichester ; Hoboken, NJ : Wiley, 047085331X (pbk.), 2004