Methods
(back to trees)
Generation of initial gene sets
·
Objective: Identify
all genes that are homologues of a given COG.
·
We use COG matrices from CDD for rpsblast.
·
Identification of COG families in SwissProt/Trembl
and RefSeq Virus genomes (1,520,638 entries) using rpsblast.
·
Classification of all genes by their phylogenomic
origin (Level 1: Eubacteria, Archebacteria, Eukaryotes, viruses, Level2: proteobacteria,
firmicutes, fungi, metazoans, double stranded DNA viruses etc.).
·
Cutoff bitscore > bitscore of the lowest
scoring original COG member [CHECK].
·
Identification of COGs with hits to one of the
911 Mimivirus genes.
·
=> 106 Mimivirus genes were found to have a
hit to a COG with a bitscore higher than the cutoff score.
Reduction of the gene set
·
Objective: reduce
gene set per COG family to at most 150 genes per, without loosing phylogenomic
information
·
STEP1: elimination of all genes that have less
than a 70% coverage with the COG reference gene from CDD (in order not to loose
too many informative positions in the alignment).
·
STEP2: elimination of all "redundant"
genes (more than 80% sequence identity over 80% of the shorter gene's length)
using blastclust.
·
Exception: Mimivirus genes were not eliminated
and at least one representative from each phylogenomic family was kept.
·
If the number of genes remaining after that
step was larger than 150, the procedure was repeated using decreasing homology
levels (70%, 60%, ...).
·
After that cycle, the three highest scoring
BLAST hits to Mimivirus were added (in order to identify eventual cases
of horizontal gene transfer).
·
In few cases, when the original gene set was
very large and the family very diverse, this gene reduction process lead to gene
sets that were too diverse to yield usable multiple alignments (less than 50
ungapped columns). In these cases, the following alternative procedure was
applied:
o
Reduction of the initial gene set to a 70%
homology level as described before.
o
Selection of the 150 best BLAST hits to the
corresponding Mimivirus gene.
o
Manual removal of alignment breaking sequences.
o
Manual addition of genes from distant
phylogenies if these were not in the 150 closest gene set where possible
without breaking the alignment.
Phylogenetic tree reconstruction
·
Objective: computation
of bootstrapped phylogenetic trees.
·
Computation of a multiple alignment using muscle.
·
Removal of gapped columns (seq_reformat) and conversion to Phylip
format (ClustalW).
·
Visual inspection of all multiple alignments (removal
of alignment-breaking sequences - mostly either remote homologues, fragmented genes
or genes with large deletions - in these cases the alignments were recomputed
as described before).
·
100 bootstraped trees were computed using
Phylip programs seqboot, protdist, neighbor and consense (default options).
·
A non-bootstrapped neighbor joining tree was
computed and rerooted using retree. The bootstrap values from the consense output were reported
on this tree.
·
Branches were colored according to the phylogenomic
origin of the corresponding genes and labeled with the original annotations
from Swissprot.