MIE453 - Bioinformatics Systems (Fall 06)

Tutorial 2 - Arrays & Hashes

Variables & Operators
Arrays
Hashes

1. Variables & Operators

Comment

A comment in Perl script begins with a # and continues from there to the end of the same line
Ignored by the Perl interpreter and is only there for programmers to read
A comment can include any text

Variables

Kinds of variables

Scalar: capable of holding a single data item
- Numerical values (integers, e.g., 12, and floating point numbers, e.g., 2.05)
- Strings (any sequence of characters)
Aggregate: capable of holding collections of values
- Arrays (i.e., lists)
- Hashes (i.e., dictionaries)

Variable declaration

each kind of variables are represented using a special character preceding the name of the variable
- $ - for scalars
- @ - for arrays
- % - for hashes
Variable names begins with a letter or underscore and can have number of letters, underscores or digits
- Variable names cannot start with a digit
Varable names are case sensitive
- $dna and $DNA are two different variables

Values

String values are delimited using either single quotation marks (e.g., 'this is a string') or double quotation marks (e.g., "this is a string")
Boolean values
- Perl evaluates a variable as False if
  - it is equal to zero (e.g. $x=0)
  - it is an empty string (e.g., $x='')
  - it is an empty array or hash
  - it is undefined
- Everything else is True
- Examples
  - 1 == 2 evaluates to FALSE
  - 1 != 2 evaluates to TRUE
  - ‘ATG’ eq ‘GAT’ evaluates to FALSE
  - (1+1) <= (1+2) evaluates to TRUE
  - (1-1) evaluates to FALSE

Variable Typing

Perl is weakly typed
- e.g., a scalar can hold a integer at one time and a string at another time
- e.g., an array can be evaluated in different context (scalar or list)

Example: Print a scalar

#!/usr/bin/perl -w
# Storing DNA in a variable, and printing it out

# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Next, we print the DNA onto the screen
print $DNA;

# Finally, we'll specifically tell the program to exit.
exit;

Example: Concatenate a scalar

#!/usr/bin/perl -w
# Concatenating DNA

# Store two DNA fragments into two variables called $DNA1 and $DNA2
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';

# Print the DNA onto the screen
print "Here are the original two DNA fragments:\n\n";

print $DNA1, "\n";

print $DNA2, "\n\n";

# Concatenate the DNA fragments into a third variable and print them
# Using "string interpolation"
$DNA3 = "$DNA1$DNA2";

print "Here is the concatenation of the first two fragments (version 1):\n\n";

print "$DNA3\n\n";

# An alternative way using the "dot operator":
# Concatenate the DNA fragments into a third variable and print them
$DNA3 = $DNA1 . $DNA2;

print "Here is the concatenation of the first two fragments (version 2):\n\n";

print "$DNA3\n\n";

# Print the same thing without using the variable $DNA3
print "Here is the concatenation of the first two fragments (version 3):\n\n";

print $DNA1, $DNA2, "\n";

exit;

Operators

Arithmetic Operators

Numeric Comparisons

String Comparisons

2. Arrays

Arrays are ordered collections of zero of more scalar values, indexed by position.

Array assignment

using parentheses
- e.g. @my_array = ()
- e.g. @dna_fragments = ('AGT','CGG', 'GGCGGA')

Accessing array elements

using square brackets
array index starts at 0
Individual array value is referenced by a $ instead of a @ at the beginning of the hash name
- e.g. $dna = $dna_fragments[0]
- e.g. $dna_fragments[0] = 'GGT'
- e.g. $dna = $dna_fragments[3] # Error!
- e.g. @new_dna = @dna_fragments[0,2]

Array copy (using assignment operator)

copy of content, not pointer!

example: array copy

#!/usr/bin/perl -w
# Array copies

# Initialize two arrays with same content
@array_1 = (1, 2, 3);
@array_2 = @array_1;

print "--- Initial values of two arrays ---\n";
print "array 1 is: @array_1\n";
print "array 2 is: @array_2\n";

# Modify the first array
$array_1[0] = 3;

print "--- New values of two arrays ---\n";
print "array 1 is: @array_1\n";
print "array 2 is: @array_2\n";

exit;

Scalar vs List context

if an array is evaluated in a scalar context, the value is the number of elements in the array

example: scalar and list context

#!/usr/bin/perl -w
# Demonstration of "scalar context" and "list context"

@bases = ('A', 'C', 'G', 'T');

print "@bases\n";

$a = @bases;

print $a, "\n";

($a) = @bases;

print $a, "\n";

exit;

Array operators

shift: take an element off the start of an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
$base1 = shift @bases;
print "@bases";

output: ?

unshift: put an element at the beginning of an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
unshift(@bases, 'U');
print "@bases";

output: ?

pop: take an element off the end of an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
$base1 = pop @bases;
print "@bases";

output: ?

push: put an element at the end of an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
push(@bases, 'U');
print "@bases";

output: ?

reverse: reverse an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
@reverse = reverse @bases;
print "@reverse";

output: ?

scalar: get the length of an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
$len = scalar @bases;
print $len;

output: ?

splice: insert an element at an arbitary place in an array

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
splice (@bases, 2, 0, 'X');
print "@bases";

output: ?

how about splice (@bases, 2, 1, 'X');

split: split a string into an array

#!/usr/bin/perl -w

$bases = 'ACGT';
@bases=split('', $bases);
print "@bases";

output: ?

join: join an array into a string

#!/usr/bin/perl -w

@bases = ('A', 'C', 'G', 'T');
$bases=join('', @bases');
print "@bases";

output: ?

sort: sort an array

#!/usr/bin/perl -w

@array = ('a', 'b', 'C',3, 1);
@sorted = sort (@array);
print "@sorted";

output: ?

A biger example: Reverse complement of DNA strand

#!/usr/bin/perl -w

#
# Calculating the reverse complement of a strand of DNA using string
# 
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 

print "Here is the starting DNA:\n\n$DNA\n\n";

# Calculate the reverse 
$revcom1 = reverse $DNA;

# Calculate the complement 
$revcom1 =~ tr/ACGTacgt/TGCAtgca/;

print "Here is the reverse complement DNA using STRING:\n\n$revcom1\n\n"; 

#
# Calculating the reverse complement of a strand of DNA using array
# 

# Split the DNA string into an array of characters
@DNA = split('', $DNA);

# Calculate the reverse 
@reverse = reverse @DNA;

# Join the array of characters of the reverse
$revcom2 = join('', @reverse);

# Calculate the complement 
$revcom2 =~ tr/ACGTacgt/TGCAtgca/;

print "Here is the reverse complement DNA using ARRAY:\n\n$revcom2\n";

3. Hashes

A hash (also called an associative array) is a collection of zero or more pairs of scalar values, called keys and values

The values are indexed by the keys
think about a dictionary: keys are the words, values are the definition for the words

Hash assignment

using parentheses
each pair corresponds to a key-value pair in the hash
- e.g. %genes = (
  'gene1', 'AACCCGGTTGGTT',
  'gene2', 'CCTTTDGGAAGGTC'
  );
a more intuitive way
- e.g. %genes = (
  'gene1' => 'AACCCGGTTGGTT',
  'gene2'=>'CCTTTDGGAAGGTC'
  );

Accessing Hash elements

using curly braces
Single hash value is referenced by a $ instead of a % at the beginning of the hash name
- e.g. $genes{'gene1'}
- e.g. $term{'bioinformatics'}='the use of computers to extract and analyze biological data';

Hash operators

keys: return a list of keys in a hash
values: return a list of values in a hash

#!/usr/bin/perl -w

%genes = (
           'gene1' => 'AACCCGGTTGGTT', 
           'gene2'=>'CCTTTDGGAAGGTC'
);

            
@keys = keys %genes;
@values = values %genes;

print "Keys are: @keys\n";
print "Values are: @values";

output: ?

reverse: map the values to keys

#!/usr/bin/perl -w

%genes = (
           'gene1' => 'AACCCGGTTGGTT', 
           'gene2'=>'CCTTTDGGAAGGTC'
);

%rev_genes = reverse %genes;            

@keys = keys %rev_genes;
@values = values %rev_genes;

print "Keys are: @keys\n";
print "Values are: @values";

output: ?

what if there are duplicates in the values?

delete: remove elements from a hash

#!/usr/bin/perl -w

%genes = (
           'gene1' => 'AACCCGGTTGGTT', 
           'gene2'=>'CCTTTDGGAAGGTC'
);

delete $genes{'gene1'};            

@keys = keys %genes;
@values = values %genes;

print "Keys are: @keys\n";
print "Values are: @values";

output: ?

Example: restriction enzyme hash

#!/usr/bin/perl -w
# Restriction enzymes are proteins that cut DNA at short, specific sequences
# e.g., EcoRI cuts where it finds GAATTC, between G and A
#
# Intialize restriction enzyme hash
# keys are the names of restriction enzymes, values are the DNA sequence they cut
# h

%re_lookup = (
          'Eco47III'=> 'AGCGCT',
          'EcoRI'   => 'GAATTC',
          'HindIII' => 'AAGCTT',
);

print "Enter restriction enzyme name\n";
$re=<STDIN>;
chomp $re;
$seq = $re_lookup{$re};
if (defined($seq)) {
    print "RE sequence for $re is: $seq\n";
}
else {
    print "Sorry, I don't know about \"$re\"";
}

Example: Generic code

#
# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 3, using hash lookup

sub codon2aa {
    my($codon) = @_;

    $codon = uc $codon;
 
    my(%genetic_code) = (
    
    'TCA' => 'S',    # Serine
    'TCC' => 'S',    # Serine
    'TCG' => 'S',    # Serine
    'TCT' => 'S',    # Serine
    'TTC' => 'F',    # Phenylalanine
    'TTT' => 'F',    # Phenylalanine
    'TTA' => 'L',    # Leucine
    'TTG' => 'L',    # Leucine
    'TAC' => 'Y',    # Tyrosine
    'TAT' => 'Y',    # Tyrosine
    'TAA' => '_',    # Stop
    'TAG' => '_',    # Stop
    'TGC' => 'C',    # Cysteine
    'TGT' => 'C',    # Cysteine
    'TGA' => '_',    # Stop
    'TGG' => 'W',    # Tryptophan
    'CTA' => 'L',    # Leucine
    'CTC' => 'L',    # Leucine
    'CTG' => 'L',    # Leucine
    'CTT' => 'L',    # Leucine
    'CCA' => 'P',    # Proline
    'CCC' => 'P',    # Proline
    'CCG' => 'P',    # Proline
    'CCT' => 'P',    # Proline
    'CAC' => 'H',    # Histidine
    'CAT' => 'H',    # Histidine
    'CAA' => 'Q',    # Glutamine
    'CAG' => 'Q',    # Glutamine
    'CGA' => 'R',    # Arginine
    'CGC' => 'R',    # Arginine
    'CGG' => 'R',    # Arginine
    'CGT' => 'R',    # Arginine
    'ATA' => 'I',    # Isoleucine
    'ATC' => 'I',    # Isoleucine
    'ATT' => 'I',    # Isoleucine
    'ATG' => 'M',    # Methionine
    'ACA' => 'T',    # Threonine
    'ACC' => 'T',    # Threonine
    'ACG' => 'T',    # Threonine
    'ACT' => 'T',    # Threonine
    'AAC' => 'N',    # Asparagine
    'AAT' => 'N',    # Asparagine
    'AAA' => 'K',    # Lysine
    'AAG' => 'K',    # Lysine
    'AGC' => 'S',    # Serine
    'AGT' => 'S',    # Serine
    'AGA' => 'R',    # Arginine
    'AGG' => 'R',    # Arginine
    'GTA' => 'V',    # Valine
    'GTC' => 'V',    # Valine
    'GTG' => 'V',    # Valine
    'GTT' => 'V',    # Valine
    'GCA' => 'A',    # Alanine
    'GCC' => 'A',    # Alanine
    'GCG' => 'A',    # Alanine
    'GCT' => 'A',    # Alanine
    'GAC' => 'D',    # Aspartic Acid
    'GAT' => 'D',    # Aspartic Acid
    'GAA' => 'E',    # Glutamic Acid
    'GAG' => 'E',    # Glutamic Acid
    'GGA' => 'G',    # Glycine
    'GGC' => 'G',    # Glycine
    'GGG' => 'G',    # Glycine
    'GGT' => 'G',    # Glycine
    );

    if(exists $genetic_code{$codon}) {
        return $genetic_code{$codon};
    }else{

            print STDERR "Bad codon \"$codon\"!!\n";
            exit;
    }
}
# dna2peptide 
#
# A subroutine to translate DNA sequence into a peptide

sub dna2peptide {

    my($dna) = @_;

    use strict;
    use warnings;

    # Initialize variables
    my $protein = '';

    # Translate each three-base codon to an amino acid, and append to a protein 
    for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
        $protein .= codon2aa(substr($dna,$i,3) );
    }

    return $protein;
}

print "Please enter your dna sequence:\n";
$dna = <STDIN>;
$peptide = dna2peptide($dna);
print "Here is the translated protein sequence: $peptide\n";

exit;

How about modify about code to accomodate the 6 reading frames?

Some examples and perl scripts are adopted from the book Beginning Perl for Bioinformatics, James Tisdall, ISBN, 0-596-00080-4, 2001.

MIE453 - Bioinformatics Systems (Fall 06)

Tutorial 2 - Arrays & Hashes

Contents

1. Variables & Operators

2. Arrays

3. Hashes