Forward Genomics Pipeline
This document provides information on how to use the 'forwardgen' pipeline. The method used by this pipeline is based on the 'Forward Genomics' paper by Michael Hiller et al. published in 2012. It is used to identify candidate genes which might be able to explain the loss of a trait within a specific species.
Pipeline created by: Peter Neleman, VU University Amsterdam, Animal Ecology, 2015.
Last update: September 2017
Index
Forwardgen is designed to find genes associated with the loss of a phenotypic trait in one or multiple species. A common example is the loss of the ability to synthesize Vitamin C in humans and some other vertebrates.
An overview of the pipeline workflow:
- Whole genome data is aligned
- Highly conserved regions are identified
- Regions located within genes are selected
- These genes are mapped against their own ancestral state
- For each gene species are ordered according to the magnitude of genetic difference, with species with lowest similarity on top
- The number of violations is calculated by comparing the order based on genetic differentiation with the phenotree, a clustering of species based on the phenotype trait (loss of trait is expected to have lowest similarity)
This document describes how to install and use this pipeline. It assumes all commands are started from the directory the script is located in.
Back to top
The pipeline was build for Ubuntu 14.04 LTS (64 bit) and uses Python 2.7, which should be installed by default. Other UNIX based operating systems will probably work as well, but have not been tested. Newer Ubuntu versions may use Python 3 as default version, which will not work. In that case make sure to set the correct Python version before proceeding.
Before running the script, first install the matplotlib python package which is used to generate various plots:
sudo apt-get install python-matplotlib
Automatic Install
An install function is included that will automatically download and compile various programs used by the pipeline. An internet connection is required. Advanced users can also install the pipeline manually with the instructions below.
python forwardgen.py install
Note: The install function was created in 2016. Included download links may change over time. Refer to the manual install when any problems occur.
Manual Install
First create all directories the pipeline will make use of:
mkdir 2bit algn anc axt bed bin chn cons etc fa fab gene maf net sing size
Download all listed software. Their binaries must be placed in the 'bin' directory. Some programs may need to be compiled first. Please refer to the respective program manuals.
UCSC Genome Browser [Website]
- axtChain
- axtSort
- axtToMaf
- chainNet
- chainPreNet
- chainSort
- faSize
- faToTwoBit
- mafsInRegion
- mafSplit
- netToAxt
- trfBig
Tandem Repeats Finder [Website]
Miller Lab [Website]
- lastz
- maf_project
- multiz
- roast
Siepel Lab [Website]
You may need to change user permission of the binaries in the 'bin' directory:
chmod 755 ./bin/*
Back to top
To get started with the pipeline a dataset has to be prepared first. The most important question is: what is the trait of interest? Then identify species that preserved this trait and species that lost this trait. At least two species of each category are required in your dataset, preferably the trait-loss group being a minority. There is no limit on the maximum number of species, but a minimum of five is recommended. The following data has to be acquired:
- Whole genome sequence data in FASTA format for each species
- Phylogenetic tree in Newick format
- Annotation of the reference species in GFF3 format
- Alignment scoring file for lastz (optional)
Whole Genome Sequence Data
For each species a FASTA file containing the whole genome is required. These can be downloaded from websites like GenBank, or you might have assembled your own.
FASTA Format:
>Sequence1
ATGGCGGCGAAAAATTGCCTGGTTAAGAATCTGGAAGC
>Sequence2
CGTGGAGACTCTCGGGTCCACTTCGGTGATCTGTTCGG
...
Forwardgen uses 6-character codes to identify species accross a lot of different files. All FASTA files should use one of these codes as name. These codes can be anything, but it is recommended to use codes that are easy to relate to the data. For example, 'DroMel.fa' is a good name for a file containing the genome of Drosophila melanogaster. All files should have the '.fa' extension. This limitation has been set to keep a lot of things easy for the human eye while still maintaining enough information to check the source of the data.
Apply these codes to all your FASTA files and place them in the 'fa' directory, which should have appeared after running the install command.
Phylogenetic Tree
The phylogeny of all species within your dataset is necessary to map the data correctly. A tree containing all species of the dataset (and no other species) must be constructed and written into the Newick Tree Format. The names of the species should be the 6-character codes, matching with the FASTA file names. Branch lenghts are not needed.
Example:
┌───── ZooNev
│
┌──┤
│ │
│ └───── TriCas
│
│ ┌───── AnoDar
│ │
└──┤ ┌── PayMac
│ │
└──┤
│
└── DroMel
Newick Tree Format: ((ZooNev,TriCas),(AnoDar,(PayMac,DroMel)));
There are various programs, for example Mesquite, that can convert a tree image into Newick format. You can of course also do it by hand.
The text tree has to be stored in a text file which must be placed in the 'etc' directory. This file can have any name.
Annotation
Only two species can be aligned at the same time. To make a multiple alignment first all species are aligned to a single species (the 'reference'), and subsequently, all pairwise alignments are merged. Any species can be the reference, but it should have a well annotated genome as the pipeline requires an annotation file (GFF3) of the reference to find conserved regions within genes. These files can often be found on the same websites as where you can download genomic data or you can acquire them through various programs. The file must be placed in the 'etc' directory.
GFF3 Format:
ChLG10 RepeatMasker repeat_region 1 576 . - . assembly_name=Tcas3;description=Kolobok1-1_TCa
ChLG10 dust repeat_region 145 165 . . . assembly_name=Tcas3;description=dust
ChLG10 trf repeat_region 2976 3004 . . . assembly_name=Tcas3;description=trf
...
Alignment Scoring (Optional)
By default the pipeline uses a standard scoring matrix for alignment. This works decently, but the quality of alignment can be improved by using a custom scoring matrix matching your dataset.
Default scoring matrix:
gap_open_penalty = 400
gap_extend_penalty = 30
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
Scoring table examples for different datasets can be found on the website of the UCSC Genome Browser. Click on species to find their '(b)lastz scoring matrix' listed in the pairwise alignments. If you want to use your own matrix you can save it in a file and place it in the 'etc' directory. When in doubt the default matrix will do just fine.
Forwardgen automatically processes all files found in directories. When a command is run again it will remove the previously available results, allowing easy modifications of scoring arguments. For this reason you should never change files in a sub-directory unless told by this manual, as this may result in the loss of data or cause the pipeline to stop functioning. The 'fa' directory is used for the initial genomic input data and the 'etc' directory is used for general input and output. In general the user does not have to change files in any of the other directories.
Everything is split in steps on purpose to allow easy modification of values and troubleshooting.
Back to top
This section is for users who used the pipeline before and want a quick reminder of the order of commands with their required arguments. Refer to the example or appendix for more details on all commands.
python forwardgen.py mask (optional)
python forwardgen.py binary
python forwardgen.py align <reference>
python forwardgen.py chain
python forwardgen.py size
python forwardgen.py net
python forwardgen.py maf
python forwardgen.py mwga <reference> <tree file>
python forwardgen.py tree <tree file>
python forwardgen.py cons
python forwardgen.py filter
python forwardgen.py annotate <annotation file>
python forwardgen.py regions
python forwardgen.py ancestral <reference> <node>
python forwardgen.py violate <tree file> <list of trait species>
python forwardgen.py plot <gene>
Back to top
This section is a step-by-step guide through the whole analysis. The aim is to find genes associated with the loss of lipogenesis in parasitoids. The dataset contains 42 insects. Tribolium castaneum was chosen as reference.
Species |
Order |
Tag |
Assembly Version |
Aleochara bilineata |
Coleoptera |
AleBil |
- |
Anopheles darlinigi |
Diptera |
AnoDar |
GCA_000211455.3 |
Apis mellifera |
Hymenoptera |
ApiMel |
GCA_000002195.1 |
Athalia rosae |
Hymenoptera |
AthRos |
GCA_000344095.1 |
Atta cephalotes |
Hymenoptera |
AttCep |
GCA_000143395.2 |
Belgica antarctica |
Diptera |
BelAnt |
GCA_000775305.1 |
Blattella germanica |
Blattodea |
BlaGer |
GCA_000762945.1 |
Bombus terrestris |
Hymenoptera |
BomTer |
GCA_000214255.1 |
Bombyx mori |
Lepidoptera |
BomMor |
GCA_000151625.1 |
Chironomus tentans |
Diptera |
ChiTen |
GCA_000786525.1 |
Cimex lectularius |
Hemiptera |
CimLec |
GCA_000648675.1 |
Clunio marinus |
Diptera |
CluMar |
GCA_900005825.1 |
Culex quinquefasciatus |
Diptera |
CulQui |
GCA_000209185.1 |
Danaus plexippus |
Lepidoptera |
DanPle |
GCA_000235995.1 |
Drosophila melanogaster |
Diptera |
DroMel |
GCA_000001215.4 |
Drosophila willistoni |
Diptera |
DroWil |
GCA_000005925.1 |
Frankliniella occidentalis |
Thysanoptera |
FraOcc |
GCA_000697945.1 |
Glossina morsitans |
Diptera |
GloMor |
GCA_001077435.1 |
Goniozus legneri |
Hymenoptera |
GonLeg |
- |
Heliconius melpomene |
Lepidoptera |
HelMel |
GCA_000313835.1 |
Isoperla grammatica |
Orthoptera |
IsoGra |
GCA_001676475.1 |
Leptopilina clavipes |
Hymenoptera |
LepCla |
- |
Lucilia sericata |
Diptera |
LucSer |
GCA_001014835.1 |
Megachile rotundata |
Hymenoptera |
MegRot |
GCA_000220905.1 |
Melitaea cinxia |
Lepidoptera |
MelCin |
GCA_000716385.1 |
Mengenilla moldrzyki |
Strepsiptera |
MenMol |
GCA_000281935.1 |
Musca domestica |
Diptera |
MusDom |
GCA_000371365.1 |
Nasonia vitripennis |
Hymenoptera |
NasVit |
GCA_000002325.2 |
Neobellieria bullata |
Diptera |
NeoBul |
GCA_001017455.1 |
Nicrophorus vespilloides |
Coleoptera |
NicVes |
GCA_001412225.1 |
Oncopeltus fasciatus |
Hemiptera |
OncFas |
GCA_000696205.1 |
Orussus abietinus |
Hymenoptera |
OruAbi |
GCA_000612105.1 |
Paykullia maculata |
Diptera |
PayMac |
- |
Pediculus humanus |
Phthiraptera |
PedHum |
GCA_000006295.1 |
Polistes dominula |
Hymenoptera |
PolDom |
GCA_001465965.1 |
Rhodnius prolixus |
Hemiptera |
RhoPro |
GCA_000181055.2 |
Solenopsis invicta |
Hymenoptera |
SolInv |
GCA_000188075.1 |
Stomoxys calcitrans |
Diptera |
StoCal |
GCA_001015335.1 |
Timema cristinae |
Phasmatodea |
TimCri |
GCA_002009905.3 |
Tipula oleracea |
Diptera |
TipOle |
GCA_001017535.1 |
Tribolium castaneum |
Coleoptera |
TriCas |
GCA_000002335.2 |
Zootermopsis nevadensis |
Isoptera |
ZooNev |
GCA_000696155.1 |
Note for species without assembly version: These species were sequenced and assembled by the VU Animal Ecology department and were not publicly released yet at the time of this writing. For more information visit the Animal Ecology Website.
Phylogeny:
((IsoGra,(TimCri,(ZooNev,BlaGer))),((PedHum,(FraOcc,(OncFas,(CimLec,RhoPro)))),((AthRos,(OruAbi,((NasVit,LepCla),(GonLeg,(PolDom,((AttCep,SolInv),(MegRot,(ApiMel,BomTer)))))))),((MenMol,((NicVes,AleBil),TriCas)),((BomMor,(MelCin,(HelMel,DanPle))),(TipOle,(((AnoDar,CulQui),(ChiTen,(CluMar,BelAnt))),((DroMel,DroWil),(GloMor,((NeoBul,(LucSer,PayMac)),(StoCal,MusDom)))))))))));
Scoring Matrix:
gap_open_penalty = 400
gap_extend_penalty = 30
A C G T
A 91 -90 -25 -100
C -90 100 -100 -25
G -25 -100 100 -90
T -100 -25 -90 91
All genomic data was put in the 'fa' directory using the 6-character codes (AleBil.fa, AnoDar.fa, ...). The tree named as 'tree.newick', the scoring matrix as 'scores.txt' and the annotation file of Tribolium castaneum, by default named 'Tribolium_castaneum.Tcas3.29.gff3', were all stored in the 'etc' directory.
The pipeline is always run in a standard order. If you wish to change a parameter of a command after the initial run, you can simply start at that command. However, you also need to run every command after the one you started with as well.
In this example not all optional arguments are being used. Optional commands always start with two dashes. You can view optional arguments by running the following command:
python forwargen.py <mode> -h
In case you don't know the available modes please refer to the appendix or run the following command:
python forwargen.py -h
Information of computer used for this example:
OS: |
Ubuntu 14.04 LTS (64 Bit) |
CPU: |
Intel Core i7-4790K (3.6 Ghz, 8 Threats)
|
RAM: |
32GB (1.6 Ghz, DDR3)
|
1. Masking (optional)
First the data has to be prepared for alignment. AT repeats can be a bottleneck for alignments, resulting in long run times. This step will mask the data. It can be skipped if the data is already masked.
python forwardgen.py mask
Note: A backup of the unmasked data will be placed in the 'fab' directory.
2. Binary
Various programs require binary FASTA files as secondary input. This step will make a binary copy of all FASTA files.
python forwardgen.py binary
3. Alignment
The next step is to align the data. Only one species will be aligned to the reference at a time, specified with its code using the first argument. Since this dataset exists out of insects only an insect scoring matrix will be used for optimal results. Computers with multiple threads can run multiple alignments in parallel to speed up the process.
python forwardgen.py align TriCas --scoring scores.txt --cpu 8
Note: The alignment is one of the slower steps in the pipeline and may take several hours to complete depending on file sizes.
4. Chaining
After alignment this step will merge overlapping sequences on the same strands into a longer alignment with a higher score.
python forwardgen.py chain
5. Size
Various programs require the length of each chromosome/contig in sequence files as secondary input. This step will count all lenghts.
python forwardgen.py size
6. Netting
Next the chains need to be converted into hierarchical clusters of chains. This step will do that conversion.
python forwardgen.py net
7. Maffing
The multiple alignment tool uses a different alignment format (MAF) than the previous programs. However, nets can not be directly turned into this format. First the nets will be converted into AXT alignments which are then converted into MAF.
python forwardgen.py maf
8. Multiple Alignment
All species are mapped to the reference (TriCas) in the correct format now. Using this reference all pairwise alignments can now be merged into a single alignment containing all species. The first argument specifies the reference using the 6-character code, just like the alignment step. The second argument specifies the tree file located in the 'etc' directory.
python forwardgen.py mwga TriCas tree.txt
9. Tree
Some steps require a tree file (mod) as secondary input. This is something different than the Newick format file created by the user! This step will create this file. Calculating the mod over the whole alignment may take a long time. Instead, the alignment will be split by chromosome/contig and only the biggest is used. This does not influence the outcome and greatly reduces the required time.
python forwardgen.py tree tree.txt
10. Conserved Regions
Next the conserved regions can be determined. It will mark all regions for each chromosome/contig in a BED track. It is possible to change the region detection settings using optional arguments, but for this dataset the default settings are used.
python forwardgen.py cons
11. Filter
The conserved regions need to be filtered. There are a lot of regions too small to work with, down to almost single basepairs. The filter will first merge regions close to each other (< 30BP). After that regions which are still shorter than the threshold (< 70BP) are removed. When working with chromosomes instead of contigs it is also wise to filter out haplotype chromosomes ('ChLGX' for TriCas) using an optional argument. Please refer to the optional arguments to change filter settings when desired.
python forwardgen.py filter --ignore ChLGX
12. Annotation
Most of the regions that passed the filter will be outside coding regions and thus not of prime interest to the current research question. Using the annotation GFF3 file, specified as argument, this step can determine whether a region is located within a gene. Regions not located within genes are discarded.
python forwardgen.py annotate Tribolium_castaneum.Tcas3.29.gff3
Note: In case you generated your own GFF3 file make sure the contig names in that file match with your reference FASTA or Forwardgen will not be able to find target regions.
13. Alignment Reconstruction
Now that the conserved regions of interest are determined the sequence data has to be extracted from the initial alignment. Using the BED tracks the sequences will be cut and put into a new multiple alignment. Just like the alignment step this can be run in parallel.
python forwardgen.py regions --cpu 8
14. Ancestral Sequences
For each node in the tree an ancestral sequence can be calculated. The sequence of interest is the common ancestor, being the node where all species originate from. For this dataset that is the node that is closest to the root, which in this case is the node of Isoperla grammatica and Timema cristinae. The first argument specifies the reference and the second argument specifies the ancestral node, using two tags seperated by a comma.
python forwardgen.py ancestral TriCas IsoGra,TimCri
Note: Sometimes the ancestral sequence program 'prequel' will store the node in a flipped order (e.g. TimCri,IsoGra), resulting in no hits. The pipeline automatically detects this and will flip the node. Refer to the troubleshooting section in case there are still zero genes mapped to nodes.
15. Violations
The alignment data of the species is compared to the ancestral sequences to calculate their percent identity with the ancestral sequence. It is expected that trait-loss species have accumulated more nucelotide differences than the trait-preserving species. The data should match the 'phenotree' instead of the phylogenetic tree. Species which do not follow this pattern are marked as 'violation'. An explanation of this algorithm can be found in the appendix.
For each gene the violations will be counted. Genes with a low violation count (preferably 0) are good candidates for further research.
The first argument is the tree file in Newick format. The second argument is a list of trait-loss species, seperated by a comma (the tag order is not important).
Two files will be created in the 'etc' directory ('violations.txt' and 'violations.png') showing a list of counts and a plot showing the violation distribution. All genes will be written into Phylip (.phy) format in the 'gene' directory, which can be loaded into various other programs.
python forwardgen.py violate tree.newick AleBil,GonLeg,LepCla,NasVit,PayMac
Note: The Phylip format files may contain more than one region, so you should not load it into other programs straight away. Please check out the details below.
16. Plotting
The analysis is now finished. It can be visualized using the plot mode. In the violation count file ('violations.txt') genes are listed with the violation count, lowest being on top. In this dataset the gene with the lowest count is 'TC014425', which will be used as argument. A plot with the name of the gene will be created in the 'etc' directory.
python forwardgen.py plot TC011290
Red regions in the plot indicate mismatches, light gray areas are matches and dark gray areas represent missing sequence data.
Identity percentages are shown on the right. The [T] tag means you specified that the species is in the trait-loss group.
The plot shows all regions found within the specific gene merged together. It may be one region or it may be multiple regions spread over the gene. If multiple regions are present within the gene they are marked with blue vertical lines. Take a good look before loading in the Phylip files into programs. You may need to do some file cutting.
Analysing Results
Genes listed as zero or low violation counts are good candidates for further research. Keep in mind that if one of your trait-loss species is in the node closest to the common ancestor it is unlikely you will find zero violations due to the nature that these species are more similar to the ancestor.
There is a variety of things you can do with the data. One option is to open the Phylip file of your target in the 'gene' directory. The "@@@@@@" tag represents the ancestral sequence to which all species are mapped to calculate their identity percent. One of the sequences can be put into tools like BLAST to possibly get more information.
Because it is possible to run commands again halfway the pipeline after you finished the analysis at least once it is really easy to look for multiple traits! Sfimply run step 15 (violations) again with a different list of species to look for a different trait. Since it is one of the last steps you only have to optionally run the plot command again afterwards.
It is also possible to add a new organism to your dataset after the initial run in case you need more data for a specific trait. The appendix contains more details about this.
Back to top
Appending Dataset
When new data is available it sometimes is desired to add a new species to the dataset (e.g. AaaAaa). However, the normal procedure should not be used because the pipeline automatically processes all files in directories. This means if a new species has to be aligned all other species will be aligned again as well.
To avoid this the pipeline has an option to add species. This will automatically run step one through seven for that species only. All steps after that must be executed again as the phylogenetic tree will change upon adding a new species.
Using the 'add' mode the 6-letter tag of the new species and reference are required as argument. The pipeline assumes that all data from the 'original' run is still located on your harddrive. Masking is skipped by default, but can be activated using an optional argument. Once finished proceed with the multiple alignment. Don't forget to add the new species to the tree file!
python forwardgen.py add AaaAaa TriCas --mask
Directories
The following directories are used by the pipeline:
2bit |
Binary FASTA |
algn |
Alignment data, split by gene |
anc |
Ancestral sequences |
axt |
Whole genome alignments |
bed |
Conserved region BED tracks |
bin |
Software |
chn |
Chains |
cons |
Conserved regions, split by chromosome/contig |
etc |
General input/output |
fa |
Genome data |
fab |
Genome data (masking backup) |
gene |
Conserved regions, split by gene |
maf |
Multiple whole genome alignments |
net |
Nets |
sing |
Alignment format conversion data |
size |
Scaffold sizes |
Commands
============================
add - add species to dataset
============================
usage: forwardgen.py add <species> <reference> [--mask]
directories:
input - 2bit, axt, chn, fa, net, size
output - 2bit, axt, chn, fa, fab (optional), net, sing, size
This will run the first seven steps of the pipeline for a specified species. This is useful when adding a new species to the dataset, as the original dataset does not have to be aligned again. It assumes the user already did a full run once and that the data of that run is still available. Step eight and onwards need to be executed to make all changes take place.
species |
Specifies which species has to be added to the dataset, identified using a 6-character code. The code should match with the FASTA file which should be added to the other data in the 'fa' directory. |
reference |
Reference species the new species should be mapped against, identified using a 6-character code. The reference should be the same as the one used in the initial dataset. |
--mask |
Optional: adding this argument will mask the data of the species to be added to the dataset. (Default = False) |
--scoring FILE |
Optional: Scoring matrix file name to apply custom scoring. The file should be located in the 'etc' directory. Leaving this argument out will apply the default 'lastz' scoring matrix. (Default = None) |
--xdrop INT |
Optional: HSP x-drop extension threshold for 'lastz'. (Default = 2200) |
--gap INT |
Optional: Gap extend penalty score for 'lastz'. (Default = 4000) |
--ydrop INT |
Optional: Threshold for terminating gapped extensions for 'lastz'. (Default = 3400) |
==============================
align - whole genome alignment
==============================
usage: forwardgen.py align <reference> [--scoring FILE] [--xdrop INT] [--gap INT] [--ydrop INT] [--cpu INT] [--default-scoring] [--show-scoring]
directories:
input - 2bit
output - axt
Aligns all binary FASTA files in the '2bit' directory to the specified reference using 'lastz'. The alignment process is regulated using MAKE. If multiple cores are assigned multiple species will be aligned at once. Alignments are stored in the 'axt' directory.
reference |
Reference species the all species should be mapped against, identified using a 6-character code. |
--scoring FILE |
Optional: Scoring matrix file name to apply custom scoring. The file should be located in the 'etc' directory. Leaving this argument out will apply the default 'lastz' scoring matrix. (Default = None) |
--xdrop INT |
Optional: HSP x-drop extension threshold for 'lastz'. (Default = 2200) |
--gap INT |
Optional: Gap extend penalty score for 'lastz'. (Default = 4000) |
--ydrop INT |
Optional: Threshold for terminating gapped extensions for 'lastz'. (Default = 3400) |
--cpu INT |
Optional: Amount of cores to use for alignment. Multiple cores allow multiple species to be aligned at once. (Default = 1) |
=========================================
ancestral - determine ancestral sequences
=========================================
usage: forwardgen.py add <reference> <node>
directories:
input - algn
output - anc
Using 'prequel' this will determine ancestral sequences based on alignment files per gene in the 'algn' directory. By default all nodes of the tree are calculated, but the mode will only take out the common ancestor - defined by the user - and store it in the 'anc' directory.
reference |
Species used to align the current dataset with, identified using a 6-character code. |
node |
Node that is closest to the root of the phylogenetic tree, indentified by using two 6-character codes split by a comma. |
==================================
annotate - split bedtrack by genes
==================================
usage: forwardgen.py annotate <annotation>
directories:
input - etc
output - bed
Using an annotation GFF3 file all regions within the filtered BED track that are located within a gene are cut out. BED tracks are stored by gene in the 'bed' directory.
annotation |
Name of the GFF3 file containing the annotation of the reference, located in the 'etc' directory. |
====================================
binary - generate binary fasta files
====================================
usage: forwardgen.py binary
directories:
input - fa
output - 2bit
This will convert all FASTA files in the 'fa' directory into binary FASTA files using 'faToTwoBit'. These binary files are stored in the '2bit' directory.
========================
chain - chain alignments
========================
usage: forwardgen.py chain
directories:
input - 2bit, axt
output - chn
Chains are merged overlapping alignments on the same stand with a higher score. All alignments in the 'axt' directory are processed using 'axtChain' and stored in the 'chn' directory.
=================================
clean - remove intermediate files
=================================
usage: forwardgen.py clean [--force]
Removes all intermediate files generated by the pipeline from the system. User input data in the 'fa', 'fab' and 'etc' directory are left intact.
--force |
Optional: Ignores the warning message and confirmation to allow implementation of this mode into scripts. (Default = False) |
==================================
cons - determine conserved regions
==================================
usage: forwardgen.py cons [--rho FLOAT] [--coverage FLOAT] [--length INT]
directories:
input - etc, maf
output - cons, etc
Using 'phastCons' all conserverd regions within every chromosome/contig of the alignments in the 'maf' directory are determined. These regions are stored in BED tracks in the 'cons' directory.
--rho |
Optional: Ratio between conserved and non-conserved state. Must be inbetween 0 and 1. (Default = 0.3) |
--coverage |
Optional: Target coverage. (Default = 0.25) |
--length |
Optional: Expected minimal length of regions. Requires the use of --coverage. (Default = 12) |
========================
filter - filter bedtrack
========================
usage: forwardgen.py filter [--ingore STR] [--maxgap INT] [--minlen INT]
directories:
input - etc
output - etc
Regions are merged based on their relative position. Based on the final lengths they are either stored in a BED track or discarded.
--ignore |
Optional: Chromosome to ignore, meaning all of the regions within that chromosome will not pass the filter. If multiple chromosomes need to be excluded they can be listed, seperated by a comma. (Default = None) |
--maxgap |
Optional: Maximum amount of base pairs regions can be away from each other to allow a merge. (Default = 30) |
--minlen |
Optional: Minimal amount of base pairs regions must be to pass the filter. (Default = 70) |
===================================
install - install pipeline software
===================================
usage: forwardgen.py install
Creates all directories and installs all software used by the pipeline. An internet connection is required.
==============================
maf - convert net to alignment
==============================
usage: forwardgen.py maf
directories:
input - 2bit, net, size
output - sing
Converts nets into the correct format for multiple alignment. Nets can not directly be converted into MAF, so they are first converted into AXT using 'netToAxt'. Afterwards these files are converted to MAF using 'axtToMaf'. Files are stored in the 'sing' directory.
==================
mask - STR masking
==================
usage: forwardgen.py mask
directories:
input - fa
output - fa, fab
Masks all FASTA files in the 'fa' directory. A backup of the original FASTA files is made in the 'fab' directory. Masking is done by 'trfBig', which will automatically call the Tandem Repeat Finder on all data.
======================================
mwga - multiple whole genome alignment
======================================
usage: forwardgen.py mwga <reference> <tree>
directories:
input - sing
output - etc
Merges all MAF alignments in the 'sing' directory into one multiple alignment using 'roast'. The alignment is stored as 'alignment.maf' in the 'etc' directory.
reference |
Species used to align the current dataset with, identified using a 6-character code. |
tree |
Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory. |
================
net - net chains
================
usage: forwardgen.py net
directories:
input - chn, size
output - net
Nets are hierarchical clusters of chains. All chains in the 'chn' directory are sorted, prepared and converted using 'chainSort', 'chainPreNet' and 'chainNet'. All nets are stored in the 'net' directory.
=====================
plot - visualize data
=====================
usage: forwardgen.py plot <gene> [--print-friendly]
directories:
input - gene
output - etc
Visualizes all conserved regions within a specific gene in a plot. Images are saved in the 'etc' directory with the name of the gene.
gene |
Name of the gene to create a plot of. |
--print-friendly |
Optional: Changes the colors of the plot to a more printer friendly selection to save ink. |
====================================
regions - cut regions from alignment
====================================
usage: forwardgen.py regions [--cpu INT]
directories:
input - bed, etc
output - algn
Cuts regions of interested out of the multiple alignment using BED tracks and 'mafsInRegion'.
--cpu INT |
Optional: Amount of cores to use for region cutting. Multiple cores allow multiple regions to be cut at once. (Default = 1) |
====================================
remove - remove species from dataset
====================================
usage: forwardgen.py remove <species>
directories:
input - 2bit, axt, chain, fa, sing, size
output - fab
Removes all data of the specified species of the first seven steps to decrease dataset to desired size. Step eight and onwards need to be executed to make all changes take place. A copy of the original FASTA is stored in the 'fab' directory with a '.removed' extension.
species |
Species to be removed from the dataset, identified using a 6-character code. |
===============================
size - calculate scaffold sizes
===============================
usage: forwardgen.py size
directories:
input - fa
output - size
Lists the size of each chromosome/scaffold within each FASTA file in the 'size' directory using 'faSize'.
====================================
tree - create phylogenetic tree file
====================================
usage: forwardgen.py tree <tree>
directories:
input - etc
output - maf
Creates a tree file using 'phyloFit' based on the alignment and provided phylogeny. The tree file is stored as 'phyloFit.mod' in the 'etc' directory.
tree |
Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory. |
=================================
violate - count violating species
=================================
usage: forwardgen.py violate <tree> <target> [--traitgap INT] [--minlen INT] [--missing INT]
directories:
input - anc
output - gene
Counts phenotree violations for each gene. Counts are stored as 'violations.txt' in the 'etc' directory. It also creates a phylip format file of each gene in the 'gene' directory which are used by the pipeline to generate plots.
tree |
Name of the file containing the phylogenetic tree in Newick format, located in the 'etc' directory. |
target |
List of trait-loss species, seperated by a comma. |
--traitgap INT |
Optional: Minimum percent between trait groups. (Default = 1) |
--minlen INT |
Optional: Minimum region length in basepairs. Due to the ancestral step some regions may have become very short after the filter step, resulting in false postives. (Default = 70) |
--missing INT |
Optional: Maximum percent of species that are allowed to have 100% missing data. (Default = 20) |
Violation Algorithm
This section explains how the algorithm works. It will use custom values for demonstration purposes.
Let's assume we have the following species with their percent identity to the ancestral sequence. Trait-loss species are marked as '1' while trait-preserving species are marked as '0'. The order below represents the phenotree.
- Species-A 33.4% 1
- Species-B 42.5% 1
- Species-C 83.4% 1
- Species-D 33.9% 0
- Species-E 43.1% 0
- Species-F 52.8% 0
- Species-G 73.5% 0
- Species-H 90.1% 0
Subsequently the species will be sorted on percentage identity, from lowest to highest:
- Species-A 33.4% 1
- Species-D 33.9% 0
- Species-B 42.5% 1
- Species-E 43.1% 0
- Species-F 52.8% 0
- Species-G 73.5% 0
- Species-C 83.4% 1
- Species-H 90.1% 0
In this example this changes the sequence of numbers: "1,0,1,0,0,0,1,0"
Now there are several violations compared to the phenotree above. The algorhythm aims to find the minimal number of permutations to retrieve the original format (trait-loss on top, trait-preserving on bottom). Every permutation counts as a violation.
In this example we can remove the zero at position 2, 4, 5 and 6. Resulting in "1,1,1,0". This, however, would mean four violations. If instead we would remove the zero in position 2 and the one in position 7 it would result in "1,1,0,0,0,0". In this case it would only take two violations, being the best solution.
If we would take a look at this new pattern it looks like this:
- Species-A 33.4% 1
- Species-B 42.5% 1
- Species-E 43.1% 0
- Species-F 52.8% 0
- Species-G 73.5% 0
- Species-H 90.1% 0
The final rule is that after fixing the sequence of species the gap inbetween the one and zero must be at least 1% by default. 42.5 and 43.1 only differ 0.6% and thus do not match this rule. The algorithm will keep cutting off species from the zeroes until the gap is more than 1%. In this case only one zero has to be cut off, as the next species results in a gap of 10.3%, which is more than enough. Each cut adds to the violation count. Species with no data available (0% identity) are ignored. If too many species have no data (default >= 20%) the gene is discarded.
This example gene has 2 (sorting) + 1 (cutting) = 3 violations.
Back to top
Running pipeline returns TabError
This means you are running the wrong Python version. Please make sure you are using Python 2.7 and not Python 3.
Alignment step returns empty files
The alignment program 'lastz' ignores lower-case characters in FASTA files sometimes created by masking tools. The masking step in the pipeline automatically changes everything in upper-case. When using already masked data the following bash command can be used to change all files in the 'fa' directory to upper-case without changing the headers:
for file in ./fa/*; do
awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' $file > $file.tmp
rm $file
mv $file.tmp $file
done
Chain step returns empty files
Some datasets will yield empty chains and the chaining itself will give errors about the sequence headers. This happens because 'lastz' sees a space and the pipeline ("|") and possibly more symbols as a line end. If the first part of all sequence haders is the same this will result in positions being called from the wrong sequence. For example headers like ">gi|697262783|emb|LK055284.1" and ">gi|634534612|emb|LK055381.2" both end up as "gi", resulting in the loss of difference. Headers like ">ChLGX_61_2016_ATCG" work fine.
A quick fix is to transform all spaces and pipes into an underscore:
for file in ./fa/*; do
tr "| " "_" < $file > $file.tmp
rm $file
mv $file.tmp $file
done
Alignments have to be restarted because the headers changed!
Annotation step maps no regions to genes
This means the FASTA and GFF3 file do not use the same headers and thus can not be used for this process. Make sure your annotation matches your reference data.
Violation step returns "no ancestral sequences found"
During some steps data will be discarded. For example the conserved regions are too small or could not be mapped onto a gene. This is more likely to happen when there is limited data available. Check the 'anc' directory for empty files. Try different settings in the filter step or reconsider the dataset design.
Ancestral step returns no output files / zero genes mapped to nodes
The ancestral sequence program 'prequel' does not always write to every node in your tree. This can also happen to the node you are supposed to fill in. In the terminal prequel will output to which nodes it writes. Find out which nodes are most consistently being written to and choose the one closest to the ancestral node. Use this node as new input and try again.
Back to top