DroSpeGe About Arthropods BLAST BioMart Maps Data News

Drosophila species genomes: Gene duplicate analysis

See this updated analysis: tandy-gi-cshl07-poster

Fig. 1. Proteins: Tandem gene x Predictor effect, PDF

tdgene_predictor_effect


Fig. 2. Exons: Tandem gene x Predictor effect, PDF

genepairs_dmel_pred4

Protein analysis: In Figure 1, gene duplicates are measured with BlastP similarity among all predicted proteins. The bar graphs show an interaction of gene predictor type and duplicate distance.

The gene mapping programs (Genewise example) and the GleanR combined set, produced by weighting gene mappers highly for Dmel homologs, show a higher rate of tandem genes predicted for the near-Dmel species, using Dmel-genes as the mapping source.

The predictor Gnomon, using an ab-initio predictor with protein homology confirmations, shows a more taxonomically neutral number of tandem gene predictions across species.

Details: The bar groupings are for gene predictors, at very near (1 Kb), medium near (6kb - 20 kb) and far (> 20 kb) distances between duplicates. The intra-group bars are seven Drosophila species, taxonomically arranged from near-Dmel to far-Dmel (Dsec, Dyak, Dere, Dpse, Dmoj, Dvir, Dgri), and colored from dark-red to light-gold. The Y-axis shows gene duplicate counts. Far duplicate group counts are scaled at 1/5 to highlight tandem duplicate counts.

The criteria used for protein similarity are reciprocal matches with bitscore >= 150 (e-value <~ 1e-50). Genes have been filtered to remove transposon repeat matches (PilerTE), and remove alternate transcripts at the same gene location. Detail figures for tandem gene prediction effect: 2-5 gene , and 10-80 gene family size


Exon analysis: In Figure 2 (A,B,C,D), gene duplicates are measured with predicted exon ungapped mapping to the genome sequence. The bar graphs show another interaction with gene predictor type and duplicate distance.

The gene mapping programs (Genewise and GeneMapper) and the GleanR combined set which selectively weights these mapper predictions, show a roughly equivalent number of nearby duplicate exons across the species.

In this Figure 2, only genes with Dros. melanogaster homologs are examined. Four predictors are compared (Gnomon, GleanR, Genewise and Genemapper). Results for eight Drosophila genomes are shown in taxonomic distance from Dros.mel (left to right).

A dotted horizontal line runs from the Dsec same_near exon count through each bar plot. For Genewise and GleanR, the same_near count is close to Dsec for all species. For Genemapper, there is a peak with Dpse, nader for Dsec and others in between. For Gnomon, the near-Dmel group (Dsec, Dyak, Dere with Dana) have similar same_near counts, and the far-Dmel group (Dpse, Dwil, Dmoj and Dgri) show a higher, near double count of nearby duplicate exons.

The left bar of three per species is "same_near". These are exons that are both predicted as part of a gene at that location, and found by similarity as nearby to another predicted gene. "Near_other" bar means an exon has a nearby duplicate that wasn't predicted by this predictor, but was by another predictor. "near_only" is for those exons with nearby matches not predicted by any method.

Discussion

This tandem duplicate analysis of gene predictors indicates there is a compuational bias in gene mapping methods. Detailed examination of cases show gene finding methods that rely on gapped homology alignment are making computational errors in regions of tandem duplicate genes.

Gene models from the homology mapped gene calls (e.g. GeneWise, Exonerate, GeneMapper) show roughly 40% to 50% gene model mistakes in areas of tandem gene duplication, using Daphnia and Dros. genomes, and different genome software. Ab initio gene predictions are less subject to this type of error, but do show it. The Gnomon prediction pipeline, combining ab initio with homology evidence and various post prediction checks is least subject to error in tandem duplicate regions.

This common problem with homology mapping is also seen with other methods that use gapped alignments. Where tandems are w/i 1kb - 5kb of each other, and have near identical exons, the mappers skip over exons to join distant ones, sometimes in complex webs of cross-connected exons. EST assembly software such as PASA makes the same class of tandem region errors (Brian Haas, TIGR.org).

The Drosophila species Glean models are subject to this error roughly in proportion to the weighting of the mapper models in its final set. This error is amplified in the GleanR set, which selected the homology mapped calls over the ab-initio prediction, excepting the Gnomon pipeline that combines methods.

This problem of errors in using gene mapping in tandem duplicate regions is known, if not widely enough.

Alexander Souvorov (Gnomon main author, pers. comm. 2007)
"Connecting tandem genes is a common plague for all predictors. If we did slightly better, this is due to our "compartmentisation" step (see the document I sent you earlier). It should work well enough for genes with good homology. When the identity drops to 30% or so, things can turn bad quite quickly."
Brian Haas (PASA main author, pers. comm. 2007)
"We've encountered this tandem-gene duplication problem before; it ususally isn't much of a problem for the ab-initio predictors, other than they might merge or split some of them, but genewise will sometimes borrow exons from different genes to piece together its prediction."
S. Chatterji and L. Pachter, Reference based annotation with GeneMapper, Genome Biology 7 (2006), R29, p5.
"The GeneMapper algorithm is unable to account for certain assembly and sequencing errors. For example, we found many cases of duplicated chicken exons, most probably due to errors in the assembly. In such cases there is no way to distinguish between the duplicate exons, and the prediction is made randomly among the duplicates. "

Implications: The gene mappers introduce errors in tandem gene regions, and cause a bias toward the species with closer homology to the mapping source proteins. This bias has the effect of reducing gene gain/increasing gene loss estimates, often strongly, and any other analyses that rely on an unbiased estimate of multi-gene families.


Related links

Daphnia gene duplicates,
Celegans, Daphnia, Drosophila, Mouse protein duplicates,
Tandem gene analyses

Don Gilbert, Aug 2007, gilbertd@indiana.edu

Developed at the Genome Informatics Lab of Indiana University Biology Department