Home | Credits | Giant Virus data/analysis Center | News | Top 100 largest viral genomes | Bibliography | Links |
Contribute

Methods
(back to trees)

 

 

Generation of initial gene sets

 

·        Objective: Identify all genes that are homologues of a given COG.

·        We use COG matrices from CDD for rpsblast.

·        Identification of COG families in SwissProt/Trembl and RefSeq Virus genomes (1,520,638 entries) using rpsblast.

·        Classification of all genes by their phylogenomic origin (Level 1: Eubacteria, Archebacteria, Eukaryotes, viruses, Level2: proteobacteria, firmicutes, fungi, metazoans, double stranded DNA viruses etc.).

·        Cutoff bitscore > bitscore of the lowest scoring original COG member [CHECK].

·        Identification of COGs with hits to one of the 911 Mimivirus genes.

·        => 106 Mimivirus genes were found to have a hit to a COG with a bitscore higher than the cutoff score.

 

Reduction of the gene set

 

·        Objective: reduce gene set per COG family to at most 150 genes per, without loosing phylogenomic information

·        STEP1: elimination of all genes that have less than a 70% coverage with the COG reference gene from CDD (in order not to loose too many informative positions in the alignment).

·        STEP2: elimination of all "redundant" genes (more than 80% sequence identity over 80% of the shorter gene's length) using blastclust.

·        Exception: Mimivirus genes were not eliminated and at least one representative from each phylogenomic family was kept.

·        If the number of genes remaining after that step was larger than 150, the procedure was repeated using decreasing homology levels (70%, 60%, ...).

·        After that cycle, the three highest scoring BLAST hits to Mimivirus were added (in order to identify eventual cases of horizontal gene transfer).

·        In few cases, when the original gene set was very large and the family very diverse, this gene reduction process lead to gene sets that were too diverse to yield usable multiple alignments (less than 50 ungapped columns). In these cases, the following alternative procedure was applied:

o       Reduction of the initial gene set to a 70% homology level as described before.

o       Selection of the 150 best BLAST hits to the corresponding Mimivirus gene.

o       Manual removal of alignment breaking sequences.

o       Manual addition of genes from distant phylogenies if these were not in the 150 closest gene set where possible without breaking the alignment.

 

Phylogenetic tree reconstruction

 

·        Objective: computation of bootstrapped phylogenetic trees.

·        Computation of a multiple alignment using muscle.

·        Removal of gapped columns (seq_reformat) and conversion to Phylip format (ClustalW).

·        Visual inspection of all multiple alignments (removal of alignment-breaking sequences - mostly either remote homologues, fragmented genes or genes with large deletions - in these cases the alignments were recomputed as described before).

·        100 bootstraped trees were computed using Phylip programs seqboot, protdist, neighbor and consense (default options).

·        A non-bootstrapped neighbor joining tree was computed and rerooted using retree. The bootstrap values from the consense output were reported on this tree.

·        Branches were colored according to the phylogenomic origin of the corresponding genes and labeled with the original annotations from Swissprot.

 


Contact: Hiro Ogata and Karsten Suhre

Last modification: 25 May 2005