Forward Genomics Pipeline

This document provides information on how to use the 'forwardgen' pipeline. The method used by this pipeline is based on the 'Forward Genomics' paper by Michael Hiller et al. published in 2012. It is used to identify candidate genes which might be able to explain the loss of a trait within a specific species.

Pipeline created by: Peter Neleman, VU University Amsterdam, Animal Ecology, 2015.

Last update: September 2017


1 General Information
2 Installation
3 Data Preparation
4 Quickstart
5 Example
6 Appendix
7 Troubleshooting

1 General Information

Forwardgen is designed to find genes associated with the loss of a phenotypic trait in one or multiple species. A common example is the loss of the ability to synthesize Vitamin C in humans and some other vertebrates.

An overview of the pipeline workflow:

This document describes how to install and use this pipeline. It assumes all commands are started from the directory the script is located in.

Back to top

2 Installation

The pipeline was build for Ubuntu 14.04 LTS (64 bit) and uses Python 2.7, which should be installed by default. Other UNIX based operating systems will probably work as well, but have not been tested. Newer Ubuntu versions may use Python 3 as default version, which will not work. In that case make sure to set the correct Python version before proceeding.

Before running the script, first install the matplotlib python package which is used to generate various plots:

sudo apt-get install python-matplotlib

Automatic Install
An install function is included that will automatically download and compile various programs used by the pipeline. An internet connection is required. Advanced users can also install the pipeline manually with the instructions below.

python install

Note: The install function was created in 2016. Included download links may change over time. Refer to the manual install when any problems occur.

Manual Install
First create all directories the pipeline will make use of:

mkdir 2bit algn anc axt bed bin chn cons etc fa fab gene maf net sing size

Download all listed software. Their binaries must be placed in the 'bin' directory. Some programs may need to be compiled first. Please refer to the respective program manuals.

UCSC Genome Browser [Website]

Tandem Repeats Finder [Website]
Miller Lab [Website]

Siepel Lab [Website]

You may need to change user permission of the binaries in the 'bin' directory:

chmod 755 ./bin/*

Back to top

3 Data Preparation

To get started with the pipeline a dataset has to be prepared first. The most important question is: what is the trait of interest? Then identify species that preserved this trait and species that lost this trait. At least two species of each category are required in your dataset, preferably the trait-loss group being a minority. There is no limit on the maximum number of species, but a minimum of five is recommended. The following data has to be acquired:

Whole Genome Sequence Data
For each species a FASTA file containing the whole genome is required. These can be downloaded from websites like GenBank, or you might have assembled your own.

FASTA Format:


Forwardgen uses 6-character codes to identify species accross a lot of different files. All FASTA files should use one of these codes as name. These codes can be anything, but it is recommended to use codes that are easy to relate to the data. For example, 'DroMel.fa' is a good name for a file containing the genome of Drosophila melanogaster. All files should have the '.fa' extension. This limitation has been set to keep a lot of things easy for the human eye while still maintaining enough information to check the source of the data.

Apply these codes to all your FASTA files and place them in the 'fa' directory, which should have appeared after running the install command.

Phylogenetic Tree
The phylogeny of all species within your dataset is necessary to map the data correctly. A tree containing all species of the dataset (and no other species) must be constructed and written into the Newick Tree Format. The names of the species should be the 6-character codes, matching with the FASTA file names. Branch lenghts are not needed.


   ┌───── ZooNev
│  │
│  └───── TriCas

│  ┌───── AnoDar
│  │
└──┤  ┌── PayMac
   │  │
      └── DroMel

Newick Tree Format: ((ZooNev,TriCas),(AnoDar,(PayMac,DroMel)));

There are various programs, for example Mesquite, that can convert a tree image into Newick format. You can of course also do it by hand.

The text tree has to be stored in a text file which must be placed in the 'etc' directory. This file can have any name.

Only two species can be aligned at the same time. To make a multiple alignment first all species are aligned to a single species (the 'reference'), and subsequently, all pairwise alignments are merged. Any species can be the reference, but it should have a well annotated genome as the pipeline requires an annotation file (GFF3) of the reference to find conserved regions within genes. These files can often be found on the same websites as where you can download genomic data or you can acquire them through various programs. The file must be placed in the 'etc' directory.

GFF3 Format:

ChLG10 RepeatMasker repeat_region 1 576 . - . assembly_name=Tcas3;description=Kolobok1-1_TCa
ChLG10 dust repeat_region 145 165 . . . assembly_name=Tcas3;description=dust
ChLG10 trf repeat_region 2976 3004 . . . assembly_name=Tcas3;description=trf

Alignment Scoring (Optional)
By default the pipeline uses a standard scoring matrix for alignment. This works decently, but the quality of alignment can be improved by using a custom scoring matrix matching your dataset.

Default scoring matrix:

gap_open_penalty   =  400
gap_extend_penalty =   30

   A     C     G     T
A   91  -114   -31  -123
C -114   100  -125   -31
G  -31  -125   100  -114
T -123   -31  -114   91

Scoring table examples for different datasets can be found on the website of the UCSC Genome Browser. Click on species to find their '(b)lastz scoring matrix' listed in the pairwise alignments. If you want to use your own matrix you can save it in a file and place it in the 'etc' directory. When in doubt the default matrix will do just fine.

Forwardgen automatically processes all files found in directories. When a command is run again it will remove the previously available results, allowing easy modifications of scoring arguments. For this reason you should never change files in a sub-directory unless told by this manual, as this may result in the loss of data or cause the pipeline to stop functioning. The 'fa' directory is used for the initial genomic input data and the 'etc' directory is used for general input and output. In general the user does not have to change files in any of the other directories.

Everything is split in steps on purpose to allow easy modification of values and troubleshooting.

Back to top

4 Quickstart

This section is for users who used the pipeline before and want a quick reminder of the order of commands with their required arguments. Refer to the example or appendix for more details on all commands.

python mask (optional)
python binary
python align <reference>
python chain
python size
python net
python maf
python mwga <reference> <tree file>
python tree <tree file>
python cons
python filter
python annotate <annotation file>
python regions
python ancestral <reference> <node>
python violate <tree file> <list of trait species>
python plot <gene>

Back to top

5 Example

This section is a step-by-step guide through the whole analysis. The aim is to find genes associated with the loss of lipogenesis in parasitoids. The dataset contains 42 insects. Tribolium castaneum was chosen as reference.

Species Order Tag Assembly Version
Aleochara bilineata Coleoptera AleBil -
Anopheles darlinigi Diptera AnoDar GCA_000211455.3
Apis mellifera Hymenoptera ApiMel GCA_000002195.1
Athalia rosae Hymenoptera AthRos GCA_000344095.1
Atta cephalotes Hymenoptera AttCep GCA_000143395.2
Belgica antarctica Diptera BelAnt GCA_000775305.1
Blattella germanica Blattodea BlaGer GCA_000762945.1
Bombus terrestris Hymenoptera BomTer GCA_000214255.1
Bombyx mori Lepidoptera BomMor GCA_000151625.1
Chironomus tentans Diptera ChiTen GCA_000786525.1
Cimex lectularius Hemiptera CimLec GCA_000648675.1
Clunio marinus Diptera CluMar GCA_900005825.1
Culex quinquefasciatus Diptera CulQui GCA_000209185.1
Danaus plexippus Lepidoptera DanPle GCA_000235995.1
Drosophila melanogaster Diptera DroMel GCA_000001215.4
Drosophila willistoni Diptera DroWil GCA_000005925.1
Frankliniella occidentalis Thysanoptera FraOcc GCA_000697945.1
Glossina morsitans Diptera GloMor GCA_001077435.1
Goniozus legneri Hymenoptera GonLeg -
Heliconius melpomene Lepidoptera HelMel GCA_000313835.1
Isoperla grammatica Orthoptera IsoGra GCA_001676475.1
Leptopilina clavipes Hymenoptera LepCla -
Lucilia sericata Diptera LucSer GCA_001014835.1
Megachile rotundata Hymenoptera MegRot GCA_000220905.1
Melitaea cinxia Lepidoptera MelCin GCA_000716385.1
Mengenilla moldrzyki Strepsiptera MenMol GCA_000281935.1
Musca domestica Diptera MusDom GCA_000371365.1
Nasonia vitripennis Hymenoptera NasVit GCA_000002325.2
Neobellieria bullata Diptera NeoBul GCA_001017455.1
Nicrophorus vespilloides Coleoptera NicVes GCA_001412225.1
Oncopeltus fasciatus Hemiptera OncFas GCA_000696205.1
Orussus abietinus Hymenoptera OruAbi GCA_000612105.1
Paykullia maculata Diptera PayMac -
Pediculus humanus Phthiraptera PedHum GCA_000006295.1
Polistes dominula Hymenoptera PolDom GCA_001465965.1
Rhodnius prolixus Hemiptera RhoPro GCA_000181055.2
Solenopsis invicta Hymenoptera SolInv GCA_000188075.1
Stomoxys calcitrans Diptera StoCal GCA_001015335.1
Timema cristinae Phasmatodea TimCri GCA_002009905.3
Tipula oleracea Diptera TipOle GCA_001017535.1
Tribolium castaneum Coleoptera TriCas GCA_000002335.2
Zootermopsis nevadensis Isoptera ZooNev GCA_000696155.1

Note for species without assembly version: These species were sequenced and assembled by the VU Animal Ecology department and were not publicly released yet at the time of this writing. For more information visit the Animal Ecology Website.


Scoring Matrix:
gap_open_penalty   =  400
gap_extend_penalty =   30

   A     C     G     T
A   91   -90   -25  -100
C  -90   100  -100   -25
G  -25  -100   100   -90
T -100   -25   -90    91

All genomic data was put in the 'fa' directory using the 6-character codes (AleBil.fa, AnoDar.fa, ...). The tree named as 'tree.newick', the scoring matrix as 'scores.txt' and the annotation file of Tribolium castaneum, by default named 'Tribolium_castaneum.Tcas3.29.gff3', were all stored in the 'etc' directory.

The pipeline is always run in a standard order. If you wish to change a parameter of a command after the initial run, you can simply start at that command. However, you also need to run every command after the one you started with as well.

In this example not all optional arguments are being used. Optional commands always start with two dashes. You can view optional arguments by running the following command:

python <mode> -h

In case you don't know the available modes please refer to the appendix or run the following command:

python -h

Information of computer used for this example:

OS: Ubuntu 14.04 LTS (64 Bit)
CPU: Intel Core i7-4790K (3.6 Ghz, 8 Threats)
RAM: 32GB (1.6 Ghz, DDR3)

1. Masking (optional)
First the data has to be prepared for alignment. AT repeats can be a bottleneck for alignments, resulting in long run times. This step will mask the data. It can be skipped if the data is already masked.

python mask

Note: A backup of the unmasked data will be placed in the 'fab' directory.

2. Binary
Various programs require binary FASTA files as secondary input. This step will make a binary copy of all FASTA files.

python binary

3. Alignment
The next step is to align the data. Only one species will be aligned to the reference at a time, specified with its code using the first argument. Since this dataset exists out of insects only an insect scoring matrix will be used for optimal results. Computers with multiple threads can run multiple alignments in parallel to speed up the process.

python align TriCas --scoring scores.txt --cpu 8

Note: The alignment is one of the slower steps in the pipeline and may take several hours to complete depending on file sizes.

4. Chaining
After alignment this step will merge overlapping sequences on the same strands into a longer alignment with a higher score.

python chain

5. Size
Various programs require the length of each chromosome/contig in sequence files as secondary input. This step will count all lenghts.

python size

6. Netting
Next the chains need to be converted into hierarchical clusters of chains. This step will do that conversion.

python net

7. Maffing
The multiple alignment tool uses a different alignment format (MAF) than the previous programs. However, nets can not be directly turned into this format. First the nets will be converted into AXT alignments which are then converted into MAF.

python maf

8. Multiple Alignment
All species are mapped to the reference (TriCas) in the correct format now. Using this reference all pairwise alignments can now be merged into a single alignment containing all species. The first argument specifies the reference using the 6-character code, just like the alignment step. The second argument specifies the tree file located in the 'etc' directory.

python mwga TriCas tree.txt

9. Tree
Some steps require a tree file (mod) as secondary input. This is something different than the Newick format file created by the user! This step will create this file. Calculating the mod over the whole alignment may take a long time. Instead, the alignment will be split by chromosome/contig and only the biggest is used. This does not influence the outcome and greatly reduces the required time.

python tree tree.txt

10. Conserved Regions
Next the conserved regions can be determined. It will mark all regions for each chromosome/contig in a BED track. It is possible to change the region detection settings using optional arguments, but for this dataset the default settings are used.

python cons

11. Filter
The conserved regions need to be filtered. There are a lot of regions too small to work with, down to almost single basepairs. The filter will first merge regions close to each other (< 30BP). After that regions which are still shorter than the threshold (< 70BP) are removed. When working with chromosomes instead of contigs it is also wise to filter out haplotype chromosomes ('ChLGX' for TriCas) using an optional argument. Please refer to the optional arguments to change filter settings when desired.

python filter --ignore ChLGX

12. Annotation
Most of the regions that passed the filter will be outside coding regions and thus not of prime interest to the current research question. Using the annotation GFF3 file, specified as argument, this step can determine whether a region is located within a gene. Regions not located within genes are discarded.

python annotate Tribolium_castaneum.Tcas3.29.gff3

Note: In case you generated your own GFF3 file make sure the contig names in that file match with your reference FASTA or Forwardgen will not be able to find target regions.

13. Alignment Reconstruction
Now that the conserved regions of interest are determined the sequence data has to be extracted from the initial alignment. Using the BED tracks the sequences will be cut and put into a new multiple alignment. Just like the alignment step this can be run in parallel.

python regions --cpu 8

14. Ancestral Sequences
For each node in the tree an ancestral sequence can be calculated. The sequence of interest is the common ancestor, being the node where all species originate from. For this dataset that is the node that is closest to the root, which in this case is the node of Isoperla grammatica and Timema cristinae. The first argument specifies the reference and the second argument specifies the ancestral node, using two tags seperated by a comma.

python ancestral TriCas IsoGra,TimCri

Note: Sometimes the ancestral sequence program 'prequel' will store the node in a flipped order (e.g. TimCri,IsoGra), resulting in no hits. The pipeline automatically detects this and will flip the node. Refer to the troubleshooting section in case there are still zero genes mapped to nodes.

15. Violations
The alignment data of the species is compared to the ancestral sequences to calculate their percent identity with the ancestral sequence. It is expected that trait-loss species have accumulated more nucelotide differences than the trait-preserving species. The data should match the 'phenotree' instead of the phylogenetic tree. Species which do not follow this pattern are marked as 'violation'. An explanation of this algorithm can be found in the appendix.

For each gene the violations will be counted. Genes with a low violation count (preferably 0) are good candidates for further research.

The first argument is the tree file in Newick format. The second argument is a list of trait-loss species, seperated by a comma (the tag order is not important).

Two files will be created in the 'etc' directory ('violations.txt' and 'violations.png') showing a list of counts and a plot showing the violation distribution. All genes will be written into Phylip (.phy) format in the 'gene' directory, which can be loaded into various other programs.

python violate tree.newick AleBil,GonLeg,LepCla,NasVit,PayMac

Note: The Phylip format files may contain more than one region, so you should not load it into other programs straight away. Please check out the details below.

16. Plotting
The analysis is now finished. It can be visualized using the plot mode. In the violation count file ('violations.txt') genes are listed with the violation count, lowest being on top. In this dataset the gene with the lowest count is 'TC014425', which will be used as argument. A plot with the name of the gene will be created in the 'etc' directory.

python plot TC011290

Red regions in the plot indicate mismatches, light gray areas are matches and dark gray areas represent missing sequence data.

Identity percentages are shown on the right. The [T] tag means you specified that the species is in the trait-loss group.

The plot shows all regions found within the specific gene merged together. It may be one region or it may be multiple regions spread over the gene. If multiple regions are present within the gene they are marked with blue vertical lines. Take a good look before loading in the Phylip files into programs. You may need to do some file cutting.

Analysing Results
Genes listed as zero or low violation counts are good candidates for further research. Keep in mind that if one of your trait-loss species is in the node closest to the common ancestor it is unlikely you will find zero violations due to the nature that these species are more similar to the ancestor.

There is a variety of things you can do with the data. One option is to open the Phylip file of your target in the 'gene' directory. The "@@@@@@" tag represents the ancestral sequence to which all species are mapped to calculate their identity percent. One of the sequences can be put into tools like BLAST to possibly get more information.

Because it is possible to run commands again halfway the pipeline after you finished the analysis at least once it is really easy to look for multiple traits! Sfimply run step 15 (violations) again with a different list of species to look for a different trait. Since it is one of the last steps you only have to optionally run the plot command again afterwards.

It is also possible to add a new organism to your dataset after the initial run in case you need more data for a specific trait. The appendix contains more details about this.

Back to top

6 Appendix

Appending Dataset
When new data is available it sometimes is desired to add a new species to the dataset (e.g. AaaAaa). However, the normal procedure should not be used because the pipeline automatically processes all files in directories. This means if a new species has to be aligned all other species will be aligned again as well.

To avoid this the pipeline has an option to add species. This will automatically run step one through seven for that species only. All steps after that must be executed again as the phylogenetic tree will change upon adding a new species.

Using the 'add' mode the 6-letter tag of the new species and reference are required as argument. The pipeline assumes that all data from the 'original' run is still located on your harddrive. Masking is skipped by default, but can be activated using an optional argument. Once finished proceed with the multiple alignment. Don't forget to add the new species to the tree file!

python add AaaAaa TriCas --mask

The following directories are used by the pipeline:

2bit Binary FASTA
algn Alignment data, split by gene
anc Ancestral sequences
axt Whole genome alignments
bed Conserved region BED tracks
bin Software
chn Chains
cons Conserved regions, split by chromosome/contig
etc General input/output
fa Genome data
fab Genome data (masking backup)
gene Conserved regions, split by gene
maf Multiple whole genome alignments
net Nets
sing Alignment format conversion data
size Scaffold sizes

add - add species to dataset
usage: add <species> <reference> [--mask]

   input  - 2bit, axt, chn, fa, net, size
   output - 2bit, axt, chn, fa, fab (optional), net, sing, size

This will run the first seven steps of the pipeline for a specified species. This is useful when adding a new species to the dataset, as the original dataset does not have to be aligned again. It assumes the user already did a full run once and that the data of that run is still available. Step eight and onwards need to be executed to make all changes take place.

species Specifies which species has to be added to the dataset, identified using a 6-character code. The code should match with the FASTA file which should be added to the other data in the 'fa' directory.
reference Reference species the new species should be mapped against, identified using a 6-character code. The reference should be the same as the one used in the initial dataset.
--mask Optional: adding this argument will mask the data of the species to be added to the dataset. (Default = False)
--scoring FILE Optional: Scoring matrix file name to apply custom scoring. The file should be located in the 'etc' directory. Leaving this argument out will apply the default 'lastz' scoring matrix. (Default = None)
--xdrop INT Optional: HSP x-drop extension threshold for 'lastz'.
(Default = 2200)
--gap INT Optional: Gap extend penalty score for 'lastz'.
(Default = 4000)
--ydrop INT Optional: Threshold for terminating gapped extensions for 'lastz'. (Default = 3400)

align - whole genome alignment
usage: align <reference> [--scoring FILE] [--xdrop INT] [--gap INT] [--ydrop INT] [--cpu INT] [--default-scoring] [--show-scoring]

   input  - 2bit
   output - axt

Aligns all binary FASTA files in the '2bit' directory to the specified reference using 'lastz'. The alignment process is regulated using MAKE. If multiple cores are assigned multiple species will be aligned at once. Alignments are stored in the 'axt' directory.

reference Reference species the all species should be mapped against, identified using a 6-character code.
--scoring FILE Optional: Scoring matrix file name to apply custom scoring. The file should be located in the 'etc' directory. Leaving this argument out will apply the default 'lastz' scoring matrix. (Default = None)
--xdrop INT Optional: HSP x-drop extension threshold for 'lastz'.
(Default = 2200)
--gap INT Optional: Gap extend penalty score for 'lastz'.
(Default = 4000)
--ydrop INT Optional: Threshold for terminating gapped extensions for 'lastz'. (Default = 3400)
--cpu INT Optional: Amount of cores to use for alignment. Multiple cores allow multiple species to be aligned at once. (Default = 1)

ancestral - determine ancestral sequences
usage: add <reference> <node>

   input  - algn
   output - anc

Using 'prequel' this will determine ancestral sequences based on alignment files per gene in the 'algn' directory. By default all nodes of the tree are calculated, but the mode will only take out the common ancestor - defined by the user - and store it in the 'anc' directory.

reference Species used to align the current dataset with, identified using a 6-character code.
node Node that is closest to the root of the phylogenetic tree, indentified by using two 6-character codes split by a comma.

annotate - split bedtrack by genes
usage: annotate <annotation>

   input  - etc
   output - bed

Using an annotation GFF3 file all regions within the filtered BED track that are located within a gene are cut out. BED tracks are stored by gene in the 'bed' directory.

annotation Name of the GFF3 file containing the annotation of the reference, located in the 'etc' directory.

binary - generate binary fasta files
usage: binary

   input  - fa
   output - 2bit

This will convert all FASTA files in the 'fa' directory into binary FASTA files using 'faToTwoBit'. These binary files are stored in the '2bit' directory.

chain - chain alignments
usage: chain

   input  - 2bit, axt
   output - chn

Chains are merged overlapping alignments on the same stand with a higher score. All alignments in the 'axt' directory are processed using 'axtChain' and stored in the 'chn' directory.

clean - remove intermediate files
usage: clean [--force]

Removes all intermediate files generated by the pipeline from the system. User input data in the 'fa', 'fab' and 'etc' directory are left intact.

--force Optional: Ignores the warning message and confirmation to allow implementation of this mode into scripts. (Default = False)

cons - determine conserved regions
usage: cons [--rho FLOAT] [--coverage FLOAT] [--length INT]

   input  - etc, maf
   output - cons, etc

Using 'phastCons' all conserverd regions within every chromosome/contig of the alignments in the 'maf' directory are determined. These regions are stored in BED tracks in the 'cons' directory.

--rho Optional: Ratio between conserved and non-conserved state. Must be inbetween 0 and 1. (Default = 0.3)
--coverage Optional: Target coverage. (Default = 0.25)
--length Optional: Expected minimal length of regions. Requires the use of --coverage. (Default = 12)

filter - filter bedtrack
usage: filter [--ingore STR] [--maxgap INT] [--minlen INT]

   input  - etc
   output - etc

Regions are merged based on their relative position. Based on the final lengths they are either stored in a BED track or discarded.

--ignore Optional: Chromosome to ignore, meaning all of the regions within that chromosome will not pass the filter. If multiple chromosomes need to be excluded they can be listed, seperated by a comma. (Default = None)
--maxgap Optional: Maximum amount of base pairs regions can be away from each other to allow a merge. (Default = 30)
--minlen Optional: Minimal amount of base pairs regions must be to pass the filter. (Default = 70)

install - install pipeline software
usage: install

Creates all directories and installs all software used by the pipeline. An internet connection is required.

maf - convert net to alignment
usage: maf

   input  - 2bit, net, size
   output - sing

Converts nets into the correct format for multiple alignment. Nets can not directly be converted into MAF, so they are first converted into AXT using 'netToAxt'. Afterwards these files are converted to MAF using 'axtToMaf'. Files are stored in the 'sing' directory.

mask - STR masking
usage: mask

   input  - fa
   output - fa, fab

Masks all FASTA files in the 'fa' directory. A backup of the original FASTA files is made in the 'fab' directory. Masking is done by 'trfBig', which will automatically call the Tandem Repeat Finder on all data.

mwga - multiple whole genome alignment
usage: mwga <reference> <tree>

   input  - sing
   output - etc

Merges all MAF alignments in the 'sing' directory into one multiple alignment using 'roast'. The alignment is stored as 'alignment.maf' in the 'etc' directory.

reference Species used to align the current dataset with, identified using a 6-character code.
tree Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory.

net - net chains
usage: net

   input  - chn, size
   output - net

Nets are hierarchical clusters of chains. All chains in the 'chn' directory are sorted, prepared and converted using 'chainSort', 'chainPreNet' and 'chainNet'. All nets are stored in the 'net' directory.

plot - visualize data
usage: plot <gene> [--print-friendly]

   input  - gene
   output - etc

Visualizes all conserved regions within a specific gene in a plot. Images are saved in the 'etc' directory with the name of the gene.

gene Name of the gene to create a plot of.
--print-friendly Optional: Changes the colors of the plot to a more printer friendly selection to save ink.

regions - cut regions from alignment
usage: regions [--cpu INT]

   input  - bed, etc
   output - algn

Cuts regions of interested out of the multiple alignment using BED tracks and 'mafsInRegion'.

--cpu INT Optional: Amount of cores to use for region cutting. Multiple cores allow multiple regions to be cut at once. (Default = 1)

remove - remove species from dataset
usage: remove <species>

   input  - 2bit, axt, chain, fa, sing, size
   output - fab

Removes all data of the specified species of the first seven steps to decrease dataset to desired size. Step eight and onwards need to be executed to make all changes take place. A copy of the original FASTA is stored in the 'fab' directory with a '.removed' extension.

species Species to be removed from the dataset, identified using a 6-character code.

size - calculate scaffold sizes
usage: size

   input  - fa
   output - size

Lists the size of each chromosome/scaffold within each FASTA file in the 'size' directory using 'faSize'.

tree - create phylogenetic tree file
usage: tree <tree>

   input  - etc
   output - maf

Creates a tree file using 'phyloFit' based on the alignment and provided phylogeny. The tree file is stored as 'phyloFit.mod' in the 'etc' directory.

tree Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory.

violate - count violating species
usage: violate <tree> <target> [--traitgap INT] [--minlen INT] [--missing INT]

   input  - anc
   output - gene

Counts phenotree violations for each gene. Counts are stored as 'violations.txt' in the 'etc' directory. It also creates a phylip format file of each gene in the 'gene' directory which are used by the pipeline to generate plots.

tree Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory.
target List of trait-loss species, seperated by a comma.
--traitgap INT Optional: Minimum percent between trait groups. (Default = 1)
--minlen INT Optional: Minimum region length in basepairs. Due to the ancestral step some regions may have become very short after the filter step, resulting in false postives. (Default = 70)
--missing INT Optional: Maximum percent of species that are allowed to have 100% missing data. (Default = 20)

Violation Algorithm
This section explains how the algorithm works. It will use custom values for demonstration purposes.

Let's assume we have the following species with their percent identity to the ancestral sequence. Trait-loss species are marked as '1' while trait-preserving species are marked as '0'. The order below represents the phenotree. Subsequently the species will be sorted on percentage identity, from lowest to highest: In this example this changes the sequence of numbers: "1,0,1,0,0,0,1,0"

Now there are several violations compared to the phenotree above. The algorhythm aims to find the minimal number of permutations to retrieve the original format (trait-loss on top, trait-preserving on bottom). Every permutation counts as a violation.

In this example we can remove the zero at position 2, 4, 5 and 6. Resulting in "1,1,1,0". This, however, would mean four violations. If instead we would remove the zero in position 2 and the one in position 7 it would result in "1,1,0,0,0,0". In this case it would only take two violations, being the best solution.

If we would take a look at this new pattern it looks like this: The final rule is that after fixing the sequence of species the gap inbetween the one and zero must be at least 1% by default. 42.5 and 43.1 only differ 0.6% and thus do not match this rule. The algorithm will keep cutting off species from the zeroes until the gap is more than 1%. In this case only one zero has to be cut off, as the next species results in a gap of 10.3%, which is more than enough. Each cut adds to the violation count. Species with no data available (0% identity) are ignored. If too many species have no data (default >= 20%) the gene is discarded.

This example gene has 2 (sorting) + 1 (cutting) = 3 violations.

Back to top

7 Troubleshooting

Running pipeline returns TabError
This means you are running the wrong Python version. Please make sure you are using Python 2.7 and not Python 3.

Alignment step returns empty files
The alignment program 'lastz' ignores lower-case characters in FASTA files sometimes created by masking tools. The masking step in the pipeline automatically changes everything in upper-case. When using already masked data the following bash command can be used to change all files in the 'fa' directory to upper-case without changing the headers:

for file in ./fa/*; do
   awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' $file > $file.tmp
   rm $file
   mv $file.tmp $file

Chain step returns empty files
Some datasets will yield empty chains and the chaining itself will give errors about the sequence headers. This happens because 'lastz' sees a space and the pipeline ("|") and possibly more symbols as a line end. If the first part of all sequence haders is the same this will result in positions being called from the wrong sequence. For example headers like ">gi|697262783|emb|LK055284.1" and ">gi|634534612|emb|LK055381.2" both end up as "gi", resulting in the loss of difference. Headers like ">ChLGX_61_2016_ATCG" work fine.

A quick fix is to transform all spaces and pipes into an underscore:

for file in ./fa/*; do
   tr "| " "_" < $file > $file.tmp
   rm $file
   mv $file.tmp $file

Alignments have to be restarted because the headers changed!

Annotation step maps no regions to genes
This means the FASTA and GFF3 file do not use the same headers and thus can not be used for this process. Make sure your annotation matches your reference data.

Violation step returns "no ancestral sequences found"
During some steps data will be discarded. For example the conserved regions are too small or could not be mapped onto a gene. This is more likely to happen when there is limited data available. Check the 'anc' directory for empty files. Try different settings in the filter step or reconsider the dataset design.

Ancestral step returns no output files / zero genes mapped to nodes
The ancestral sequence program 'prequel' does not always write to every node in your tree. This can also happen to the node you are supposed to fill in. In the terminal prequel will output to which nodes it writes. Find out which nodes are most consistently being written to and choose the one closest to the ancestral node. Use this node as new input and try again.

Back to top