evolutionary tree

tqDist - a triplet and quartet distance library

Distance measures between trees are useful for comparing trees in a systematic manner and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced sub-trees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order n3 or n4 just for enumerating the sets. The different topologies can be counted implicitly, however, and this tqDist computes the triplet distance between rooted trees in O(n log n) time and the quartet distance between unrooted trees in O(dn log n) time, where d degree of the tree with the smallest degree.

tqDist has been implemented in C and C++ to achieve maximal performance, but bindings to Python and R are also supplied such that the library easily can be used for scripting. In addition to this several binary executables are provided.

In the following text we describe how to install tqDist on Mac OS X, Windows and Linux and illustrate how to use the enclosed Python module and R package. Finally, we explain how to use the binary executables.

  1. Citing tqDist
  2. Files
  3. Build procedure
  4. Using the Python bindings
  5. Using the R bindings
  6. Executables
  7. File format
  8. Literature
  9. Contact

Citing tqDist

If you use tqDist in your research, please cite it in the following way:

You can e.g. use this BibTex entry.

The datasets used for the performance experiments in this Applications Note are available for download at http://birc.au.dk/~cstorm/software/tqDist/data/.

Files

Build procedure

Install binaries from source

To install tqDist from source on Linux or Mac OS X download tqDist-1.0.0.tar.gz and execute the following commands in a terminal:

$ cd <path-to-file>
$ tar -xvf tqDist-1.0.0.tar.gz
$ cd tqDist-1.0.0/
$ cmake .
$ make
$ make test
$ make install

This will also install the python module pyTQDist and/or the R package rtqdist if python and/or R is installed on your system.

If you do not have permisions to install the package in the default location, you need to replace the fourth command above by something on this form:

$ cmake -DCMAKE_INSTALL_PREFIX=<install prefix> \
        -DPYTHON_PREFIX=<python libs> -DR_PREFIX=<R libs> .

Finally, if CMake is not installed on your system, you can download the newest version from http://cmake.org/.

Mac OS X

To install the python module pyTQDist on Mac OS X download pyTQDist_1.0.zip and execute the following commands in a terminal:

$ cd <path-to-file>
$ unzip pyTQDist_1.0.zip
$ cd pyTQDist_1.0
$ python setup.py install

To install the R package rtqdist on Mac OS X download rtqdist_1.0.tar.gz and execute the following commands in a terminal:

$ cd <path-to-file>
$ R CMD INSTALL rtqdist_1.0.tar.gz

Windows

To install the tqDist executables on Windows download and run tqDist-1.0.0-x86_32.exe.

To install the python module pyTQDist on Windows download and unzip pyTQDist_1.0.zip and run the following commands in a command window:

$ cd <path-to-pyTQDist_1.0>
$ python setup.py install

If python is not recognized as a command in your command prompt, you need to add python to your path.

To install the R package rtqdist on Windows download rtqdist_1.0.tar.gz and run the following commands:

$ cd <path-to-file>
$ R CMD INSTALL --no-multiarch rtqdist_1.0.tar.gz

If R is not recognized as a command in your command prompt, you need to update your path.

Linux

To install tqDist on Linux you need to follow the instructions to install binaries from source. This will also install the R package and/or the python module if R and/or python is installed on your system.

Using the Python bindings

The following example illustrates how the Python module pyTQDist can be used:

from pyTQDist import *

filename1 = "trees/tree1.new"
filename2 = "trees/tree2.new"

print "triplet distance:", tripletDistance(filename1, filename2)
print "quartet distance:", quartetDistance(filename1, filename2)

print allPairsTripletDistance("trees/8trees.new")
print allPairsQuartetDistance("trees/8trees.new")

print pairsTripletDistance("trees/pairs1.new", "trees/pairs2.new")
print pairsQuartetDistance("trees/pairs1.new", "trees/pairs2.new")

tripletDistance and quartetDistance each take two parameter, which should be filenames pointing to files containing a single tree in Newick format each. For each tree all leaves should be labeled and the two trees should share the same set of leaf labels. The two functions return the distance between the two trees as a single integer.

pairsTripletDistance and pairsQuartetDistance each take two parameter, which should be pointing to files containing the same number of trees in Newick format. The trees on line i in the two files must have the same set of leaf labels. The two functions return a list of distances where the ith distance is the pairwise distance between the two trees on line i in the two files.

allPairsTripletDistance and allPairsQuartetDistance each take a single parameter, which should be a filename containing multiple trees in Newick format. Each tree should be on a seperate line. In each tree all leaves should be labeled and all trees should share the same set of leaf labels. The two functions return a lower triangular as a list of lists in which entry [i][j] is the pairwise triplet or quartet distance between the trees on line i and j.

Using the R bindings

The following example illustrates how the R package rtqdist can be used:

library("rtqdist")

filename1 = "trees/tree1.new"
filename2 = "trees/tree2.new"

t = tripletDistance(filename1, filename2)
q = quartetDistance(filename1, filename2)

allT = allPairsTripletDistance("trees/8trees.new")
allQ = allPairsQuartetDistance("trees/8trees.new")

pT = pairsTripletDistance("trees/pairs1.new", "trees/pairs2.new")
pQ = pairsQuartetDistance("trees/pairs1.new", "trees/pairs2.new")

tripletDistance and quartetDistance each take two parameter, which should be filenames pointing to files containing a single tree in Newick format each. For each tree all leaves should be labeled and the two trees should share the same set of leaf labels. The two functions return the distance between the two trees as a single integer.

pairsTripletDistance and pairsQuartetDistance each take two parameter, which should be pointing to files containing the same number of trees in Newick format. The trees on line i in the two files must have the same set of leaf labels. The two functions return a list of distances where the ith distance is the pairwise distance between the two trees on line i in the two files.

allPairsTripletDistance and allPairsQuartetDistance each take a single parameter, which should be a filename containing multiple trees in Newick format. Each tree should be on a seperate line. In each tree all leaves should be labeled and all trees should share the same set of leaf labels. The two functions return a lower triangular as a list of lists in which entry [i][j] is the pairwise triplet or quartet distance between the trees on line i and j.

Executables

The following executables are installed with tqDist:

triplet_dist

Usage: triplet_dist [-v] <filename1> <filename2>

Where <filename1> and <filename2> point to two files each containing one tree in Newick format. In both trees all leaves should be labeled and the two trees should have the same set of leaf labels. The triplet distance between the two trees will be printed to stdout.

If the -v option is used, the following numbers will be reported (in this order):

quartet_dist

Usage: quartet_dist [-v] <filename1> <filename2>

Where: <filename1> and <filename2> point to two files each containing one tree in Newick format. In both trees all leaves should be labeled and the two trees should have the same set of leaf labels. The quartet distance between the two trees will be printed to stdout.

If the -v option is used, the following numbers will be reported (in this order):

pairs_triplet_dist

Usage: bin/pairs_triplet_dist [-v] <filename1> <filename2> [<output filename>]

Where: <filename1> and <filename2> point to two files each containing the same number of trees in Newick format. The two trees on line i in the two files must have the same set of leaf labels. The output is a list of numbers, where the i'th number is the triplet distance between the two trees on line i in the two files. If [output filename] is specified the output is written to the file pointed to (if the file already exists the current content is deleted first), otherwise the output is written to stdout.

If the -v option is used, the following numbers will be reported for each pair of trees (in this order):

pairs_quartet_dist

Usage: bin/pairs_quartet_dist [-v] <filename1> <filename2> [<output filename>]

Where: <filename1> and <filename2> point to two files each containing the same number of trees in Newick format. The two trees on line i in the two files must have the same set of leaf labels. The output is a list of numbers, where the i'th number is the quartet distance between the two trees on line i in the two files. If [output filename] is specified the output is written to the file pointed to (if the file already exists the current content is deleted first), otherwise the output is written to stdout.

If the -v option is used, the following numbers will be reported for each pair of trees (in this order):

all_pairs_triplet_dist

Usage: all_pairs_triplet_dist <input filename> [output filename]

Where: <input filename> is the name of a file containing multiple trees in Newick format. Each tree should be on a seperate line. In each tree all leaves should be labeled and all trees should have the same set of leaf labels. If [output filename] is specified the output is written to the file pointed to (if the file already exists the current content is deleted first), otherwise the output is written to stdout. The output is a lower triangular matrix in which the i, j'th entry is the pairwise triplet distance between the tree on line i and the tree on line j in <input filename>.

all_pairt_quartet_dist

Usage: all_pairs_quartet_dist <input filename> [output filename]

Where: <input filename> is the name of a file containing multiple trees in Newick format. Each tree should be on a seperate line. In each tree all leaves should be labeled and all trees should have the same set of leaf labels. If [output filename] is specified the output is written to the file pointed to (if the file already exists the current content is deleted first), otherwise the output is written to stdout. The output is a lower triangular matrix in which the i, j'th entry is the pairwise quartet distance between the tree on line i and the tree on line j in <input filename>.

File format

All input files for tqDist should contain trees encoded in Newick format. Here is couple of examples that can be used with the quartet_distance and triplet_distance executables:

And here is an example of a file containing multiple trees, which can be used with the all_pairs_quartet_distance, all_pairs_triplet_distance, pairs_triplet_distance and pairs_triplet_distance executables:

For details on how to encode trees in Newick format please see http://en.wikipedia.org/wiki/Newick_format.

Literature

The algorithms implemented in tqDist was developed by Andreas Sand, Gerth Stølting Brodal, Rolf Fagerberg, Thomas Mailund, Christian N. S. Pedersen, Jens Johansen and Morten K. Holt, and they are explained in the following two papers:

The implementation is desribed in:

Contact

If you encounter any problems or have questions about using tqDist, please contact Christian N. S. Pedersen.