RNAcontext is a motif model to predict the sequence and structure preferences of RNA-binding
proteins. There are two files that have to be provided to input RNAcontext:
- A set of sequnces together with their estimated binding affinities 
- RNA secondary structure annotations of the sequences estimated using SFOLD (details are below).

-----------------------------------------------------------------------------
                           Publications             
-----------------------------------------------------------------------------
RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins
Hilal Kazan, Debashish Ray, Esther T Chan, Timothy R Hughes, Quaid Morris
PLoS Comput Biol 6(7): e1000832. doi:10.1371/journal.pcbi.1000832

Rapid and systematic analysis of the RNA recognition specificities of RNA-binding protein
Debashish Ray and Hilal Kazan et al. 
Nature Biotechnology 27, 667-670 (2009)

-----------------------------------------------------------------------------
                           Install             
-----------------------------------------------------------------------------

The program should run under all Linux distributions, g++ is required. Type: 

 make

and run using the options described below.

-----------------------------------------------------------------------------
                           How to generate the annotation profiles?             
-----------------------------------------------------------------------------

SFOLD is required, you can download it from here: http://sfold.wadsworth.org/SFOLD-EXE-ACADEMIC.html

Please follow the guidelines provided in SFOLD package to install the software.

By default, SFOLD can generate the probability profile for a single sequence. By using the helper code that 
we provide (in helper_code_for_SFOLD), you can generate profiles for a set of sequences in a FASTA file.

How to compile:
First, copy the files (sfold_helper.cpp and sprofile.cpp) in directory helper_code_for_SFOLD to SFOLD directory. 

g++ sfold_helper.cpp -o sfold_helper
g++ sprofile.cpp -o sprofile

Requirements: 
- Create folders named data and out under SFOLD directory.
- Your FASTA file should be in SFOLD directory.
- sfold_helper and sprofile executables should be in SFOLD directory. 

How to run:

./sfold_helper <input_file_name> <output_file_name>

Example:
./sfold_helper sequences.fasta  sequences_profiles.txt 


Note: For long sequences, RNAplfold is more suitable to generate annotation profiles. You can download the scripts
and helper code from RNAcontext website.


-----------------------------------------------------------------------------
                           Usage        
-----------------------------------------------------------------------------
Description of options

    -a 	  <alphabet> (default ACGU)
	  determines the alphabet e.g. alphabet should be ACGT if you're using DNA sequences, and ACGU for RNA sequences etc.

    -e    <annotation alphabet> (default PLMU)
          determines the annotation alphabet. For instance, if the structure profile file has two rows for each sequence, 
	  for paired and unpaired contexts respectively, then you should use a two letter alphabet. You can choose letters
	  for your convenience e.g.PU

    -w    <motifwidth range> (default 4-10)
          controls the size range of the motif e.g. if you run the program with -w 4-7, the algorithm searches for motifs
          starting from width 4 until width 7. It's highly recommended that you start from small motif lengths even when a 
	  longer motif length is suitable. This is because of the initialization procedure that uses the previously learned 
          models for smaller motif lengths to initialize longer motif lengths. For example, if you'd like to run RNAcontext
          with motif length 8, you should run it with -w 4-8 and not -w 8-8.

    -c    <training input filename>
          The name of the input file which contains the sequences of interest with corresponding intensities. RNAcontext
          will use these sequences to find the motif model which explains the data best. 

          The format of the input file should be:

	  intensity \tab sequence 

	  0.34	AGCGAGUCGAGAGCUCUUAGAGGCUAUAUAUGCGAG	
	  -1.45 GGAGAGCGGAGAUCUUCUAGAGCUUAGAGGCGAGAGAG

	 If there is binary information (i.e. bound or unbound) about the data, please input intensity of 1 for bound sequences
	 and -1 for unbound sequences.
	
    -d 	  <test input filename>
	  The name of the input file which contains the test sequences. The motif model learned from the training sequences
          will be used to score test sequences.
	 
          The format of the input file should be:
	  intensity \tab sequence 

    -h 	  <annotation profile for training sequences>
	  The name of the file which contains annotation profiles for the training sequences

    -n 	  <annotation profile for test sequences>
	  The name of the file which contains annotation profiles for the test sequences

    -o 	  <output filename key>
	  A number of output files with filenames containing <output filename key> are generated under directory ./outputs.
 

      	  - model_<output filename key>_<motifwidth>.txt   e.g. model_VTS1_4.txt

             This file contains the training error, the number of fitted parameters, predicted base parameters, annotation 
             parameter, bias parameters, scaling factor and intercept in least squares optimization. 

	     Four lines following "Base Parameters" has the predicted sequence preference for each base ( rows) and for each position (columns)		
       
	  - params_<output filename key>.txt

            In short, this file contains the PWMs and relative structural context affinities for each motif width.   

            If you would like to plot logos to show sequence preference, you can use the matrix following the "Base parameters" line. We recommend that you use enologos software (http://www.benoslab.pitt.edu/cgi-bin/enologos/enologos.cgi) with weight type "energies" to plot logos.
	    To see the relative affinities to each structural context (these are plot in Figure 4 of the PLos CompBio paper) you can use the values provided for 
	    each structural context. 

	 - train_<output filename key>_<motifwidth>.txt

            This is a tab delimited file where at each line, the first entry is the experimentally determined affinity
	    of a sequence in the training set and the second entry is the RNAcontext predicted affinity (score) for that sequence. 

         - test_<output filename key>_<motifwidth>.txt
	    
	    This is a tab delimited file where at each line, the first entry is the experimentally determined affinity
	    of a sequence in the test set and the second entry is the RNAcontext predicted affinitiy (score) for that sequence. 


    -s    <number of initializations or restarts> (default 5)
	  It's useful to set s at least 3.


Example Run 

 ./bin/rnacontext -w 4-5 -a ACGU -e PLMU -s 3  -c  VTS1_training_sequences.txt  -h VTS1_training_annotations.txt -d VTS1_test_sequences.txt -n VTS1_test_annotations.txt -o  VTS1_demo