This repository contains an implementation of Profile HMMs (hidden Markov models) intended for aligning multiple words together as described in our paper, Multiple Word Alignment with Profile Hidden Markov Models. Profile HMMs are used in biological sequence analysis, specifically for the alignment of sequences and the testing of sequences for membership in a family. At the time of implementation, the existing Profile HMM libraries (SAM and HMMER) were too specific to biological sequences and difficult to adapt to word-related tasks.
The tools and the library are licensed under the GPL for non-commercial use only. If you use it in your research, please cite the corresponding paper:
- Aditya Bhargava and Grzegorz Kondrak. 2009. Multiple Word Alignment with Profile Hidden Markov Models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, pages 43–48, Boulder, Colorado. Association for Computational Linguistics.
Only Linux is supported.1 Other than the usual build dependencies (like g++
, make
, etc.), you will need:
Most (if not all) distributions provide packages for these; you'll have to consult your distribution's package repository to determine what they're called. When installing the prerequisites, make sure you include the development header packages as well as the runtime libraries (most, but I think not all, distributions separate the two).
Given that the code was written around 2008, it's somewhat miraculous that it still compiles with just a few warnings (that I am choosing to ignore). Simply check out the code from this repository and then run make
from within the phmmtools
subdirectory.2
The code is split into a Profile HMM library and a pair of CLI (command-line interface) tools.
The library, libprofilehmm
, provides all of the necessary functions to create and train a Profile HMM, as well as align or score words to a model. The library is undocumented, though, so the best way to make use of the library by itself is to take a look at the source code for the CLI tools.
We tested Profile HMMs for applicability to multiple word alignment and cognate set matching (I refer to "cognate sets" as "clusters" here). There are thus two programs:
multialign
for multiple alignment; andfindcluster
to match a word to a cluster.
For either tool, call the program with the -h
argument to get a list of options:
$ ./multialign -h
Allowed options:
Generic options:
-h [ --help ] Prints help message.
Configuration options:
-o [ --alignments ] arg (=alignments.txt)
The file to which to write the
alignments. Note that the file will be
overwritten if it exists.
-P [ --bw-epsilon ] arg (=1.0000000000000001e-05)
The value within which log-likelihood
scores must converge for Baum-Welch.
-M [ --bw-max-iterations ] arg (=0) The maximum number of iterations for
Baum-Welch. A value of zero (0) means
no maximum and Baum-Welch is performed
to convergence.
-R [ --bw-pc-during ] arg (=1) Specify whether to add pseudocounts
during each Baum-Welch iteration (true)
or after it is finished (false).
-c [ --clusters ] arg (=clusters.txt) The file from which to read the
clusters. The clusters can have
alignment data; this is ignored unless
using --prealigned=1.
-x [ --cpmatrix ] arg (=cp.txt) The file from which to read the
conditional probabilities for a
substitution-matrix-based pseudocount
method. A 26x26 matrix is read in,
where matrix[a][b] = P(a|b). In order
to use this, we must have
--pc-method=2.
-F [ --favour-m ] arg (=1) Specify whether to initailise the
profile HMM with a favouring of the
transitions to match states. Does
nothing if --prealigned=1.
-q [ --frequencies ] arg (=q.txt) The file containing the character
frequencies.
-Q [ --freq-scale ] arg (=3500) The scaling factor for the frequencies
which are used as parameters for a
Dirichlet distribution when
initializing a random model.
-i [ --inserts ] arg (=1) Specify whether to construct the
profile HMM from the clusters using
both match and insert states (true) or
using match states only (false) when
building using --prealigned=1.
-m [ --pc-method ] arg (=2) Specify the pseudocount method used.
Options are 0 for a constant being
added to each value (set the weight to
the alphabet size for Laplace's rule),
1 for a background-frequency-based
pseudocount, and 2 for a
substitution-matrix-based pseudocount.
Option 2 requires the --cpmatrix
parameter to be set.
-w [ --pc-weight ] arg (=5) The weight used for the emission-state
pseudocounts.
-p [ --prealigned ] arg (=0) Specify whether to initialize the
profile HMM using pre-existing
alignment information, and then perform
Baum-Welch to optimize the model and
generate new alignments.
-S [ --seed ] arg (=1678045268) The seed used for the random number
generator. Useful only if
--prealigned=0. If no value is
specified, a call to std::time(NULL) is
used.
-a [ --transition-a ] arg (=5) The alpha-values used for a uniform
Dirichlet distribution from which the
initial transition probabilities are
sampled.
Any and all parameters can be specified via a multialign.cfg file. Use the following format:
#Anything after a # on a line is not processed (i.e. is a comment)
param=value #e.g. prealigned=1
$ ./findcluster -h
Allowed options:
Generic options:
-h [ --help ] Prints help message.
Configuration options:
-P [ --bw-epsilon ] arg (=1.0000000000000001e-05)
The value within which log-likelihood
scores must converge for Baum-Welch.
-M [ --bw-max-iterations ] arg (=0) The maximum number of iterations for
Baum-Welch. A value of zero (0) means
no maximum and Baum-Welch is performed
to convergence.
-R [ --bw-pc-during ] arg (=1) Specify whether to add pseudocounts
during each Baum-Welch iteration (true)
or after it is finished (false).
-c [ --clusters ] arg (=clusters.txt) The file from which to read the
clusters. The clusters can have
alignment data; this is ignored unless
using --prealigned=1.
-x [ --cpmatrix ] arg (=cp.txt) The file from which to read the
conditional probabilities for a
substitution-matrix-based pseudocount
method. A 26x26 matrix is read in,
where matrix[a][b] = P(a|b). In order
to use this, we must have
--pc-method=2.
-F [ --favour-m ] arg (=1) Specify whether to initailise the
profile HMM with a favouring of the
transitions to match states. Does
nothing if --prealigned=1.
-q [ --frequencies ] arg (=q.txt) The file containing the character
frequencies.
-Q [ --freq-scale ] arg (=3500) The scaling factor for the frequencies
which are used as parameters for a
Dirichlet distribution when
initializing a random model.
-i [ --inserts ] arg (=1) Specify whether to construct the
profile HMM from the clusters using
both match and insert states (true) or
using match states only (false) when
building using --prealigned=1.
-m [ --pc-method ] arg (=2) Specify the pseudocount method used.
Options are 0 for a constant being
added to each value (set the weight to
the alphabet size for Laplace's rule),
1 for a background-frequency-based
pseudocount, and 2 for a
substitution-matrix-based pseudocount.
Option 2 requires the --cpmatrix
parameter to be set.
-w [ --pc-weight ] arg (=5) The weight used for the emission-state
pseudocounts.
-p [ --prealigned ] arg (=0) Specify whether to initialize the
profile HMM using pre-existing
alignment information, and then perform
Baum-Welch to optimize the model and
generate new alignments.
-S [ --seed ] arg (=1678045341) The seed used for the random number
generator. Useful only if
--prealigned=0. If no value is
specified, a call to std::time(NULL) is
used.
-s [ --singles ] arg (=0) Specify whether to score against
single-word clusters.
-a [ --transition-a ] arg (=5) The alpha-values used for a uniform
Dirichlet distribution from which the
initial transition probabilities are
sampled.
Any and all parameters can be specified via a findcluster.cfg file. Use the following format:
#Anything after a # on a line is not processed (i.e. is a comment)
param=value #e.g. prealigned=1
The data format uses, for each cluster, a single line starting with a hash (#
)3; anything after the hash is considered a label for the cluster. The next lines in the file are all considered part of the cluster until the next hash is read. Anything past a double-hash (##
) is not read at all. See testclusters.txt for an example, which shows a couple of test clusters that I used.
The background frequency file expects 26 lines, each containing the background frequency (i.e. frequency of occurrence) of the corresponding character (A-Z). See q.txt for an example, which is the one I used in my experiments.
The conditional probability file for the substitution matrix pseudocount method expects a 26×26 matrix. Each line corresponds to a character, and each entry on the line is separated by a space. Entry
None of these is likely to be implemented (by me), but I had the following list of TODO items for a v1.0 release:
- Parameterize the alphabet size (currently hard-coded to 26)
- Maximum a posteriori (MAP) model construction; there is some code present that does not work correctly
- Any other TODOs marked in the code
- Add some proper versioning info
- Code profiling to find possible speed-ups
- Documentation
Footnotes
-
The code used to work just fine on Windows, as all the prerequisites are cross-platform and the code was written to work with both
g++
and MSVC. But I don't have access to (or the patience for) a Windows machine anymore, so I have no idea the extent to which this holds true 15 years after the code was written. ↩ -
The code compiles, but whether or not it works... 🤷♂️ ↩
-
But I'm also a descriptivist, so, you know, no (overt) judgement or anything. ↩