Drosophila species genomes: Gene duplicate analysis
Fig. 1. Proteins: Tandem gene x Predictor effect,
Fig. 2. Exons: Tandem gene x Predictor effect,
In Figure 1, gene duplicates are measured with BlastP similarity
among all predicted proteins. The bar graphs show an interaction
of gene predictor type and duplicate distance.
The gene mapping programs (Genewise example) and the GleanR
combined set, produced by weighting gene mappers highly for Dmel
homologs, show a higher rate of tandem genes predicted for the
near-Dmel species, using Dmel-genes as the mapping source.
The predictor Gnomon, using an ab-initio predictor with
protein homology confirmations, shows a more taxonomically
neutral number of tandem gene predictions across species.
The bar groupings are for gene predictors, at very near (1 Kb),
medium near (6kb - 20 kb) and far (> 20 kb) distances between
The intra-group bars are seven Drosophila species, taxonomically
arranged from near-Dmel to far-Dmel (Dsec, Dyak, Dere, Dpse, Dmoj,
Dvir, Dgri), and colored from dark-red to light-gold.
The Y-axis shows gene duplicate counts.
Far duplicate group counts are scaled at 1/5 to highlight
tandem duplicate counts.
The criteria used for protein similarity are reciprocal matches with
bitscore >= 150 (e-value <~ 1e-50).
Genes have been filtered to remove transposon repeat matches (PilerTE),
and remove alternate transcripts at the same gene location.
Detail figures for tandem gene prediction effect:
2-5 gene , and
10-80 gene family size
In Figure 2 (A,B,C,D), gene duplicates are measured with predicted exon
ungapped mapping to the genome sequence.
The bar graphs show another interaction with gene predictor
type and duplicate distance.
The gene mapping programs (Genewise and GeneMapper) and the
GleanR combined set which selectively weights these mapper
predictions, show a roughly equivalent number of nearby
duplicate exons across the species.
In this Figure 2, only genes with Dros. melanogaster homologs
are examined. Four predictors are compared (Gnomon, GleanR,
Genewise and Genemapper). Results for eight Drosophila genomes are
shown in taxonomic distance from Dros.mel (left to right).
A dotted horizontal line runs from the Dsec same_near exon count
through each bar plot. For Genewise and GleanR, the same_near
count is close to Dsec for all species. For Genemapper, there
is a peak with Dpse, nader for Dsec and others in between.
For Gnomon, the near-Dmel group (Dsec, Dyak, Dere with Dana)
have similar same_near counts, and the far-Dmel group (Dpse,
Dwil, Dmoj and Dgri) show a higher, near double count of nearby
The left bar of three per species is "same_near". These are exons
that are both predicted as part of a gene at that location, and
found by similarity as nearby to another predicted gene.
"Near_other" bar means an exon has a nearby duplicate that
wasn't predicted by this predictor, but was by another predictor.
"near_only" is for those exons with nearby matches not predicted
by any method.
This tandem duplicate analysis of gene predictors indicates
there is a compuational bias in gene mapping methods.
Detailed examination of cases show gene finding methods
that rely on gapped homology alignment are making computational
errors in regions of tandem duplicate genes.
Gene models from the homology mapped gene calls (e.g. GeneWise,
Exonerate, GeneMapper) show roughly 40% to 50% gene model
mistakes in areas of tandem gene duplication, using Daphnia and
Dros. genomes, and different genome software. Ab initio gene
predictions are less subject to this type of error, but do show
it. The Gnomon prediction pipeline, combining ab initio with
homology evidence and various post prediction checks is least
subject to error in tandem duplicate regions.
This common problem with homology mapping is also seen with other
methods that use gapped alignments. Where tandems are w/i 1kb -
5kb of each other, and have near identical exons, the mappers
skip over exons to join distant ones, sometimes in complex webs
of cross-connected exons. EST assembly software such as PASA
makes the same class of tandem region errors (Brian Haas,
The Drosophila species Glean models are subject to this error roughly
in proportion to the weighting of the mapper models in its final
set. This error is amplified in the GleanR set, which selected
the homology mapped calls over the ab-initio prediction,
excepting the Gnomon pipeline that combines methods.
This problem of errors in using gene mapping in tandem duplicate
regions is known, if not widely enough.
- Alexander Souvorov (Gnomon main author, pers. comm. 2007)
"Connecting tandem genes is a common plague for all predictors.
If we did slightly better, this is due to our
"compartmentisation" step (see the document I sent you
earlier). It should work well enough for genes with good
homology. When the identity drops to 30% or so, things can turn
bad quite quickly."
- Brian Haas (PASA main author, pers. comm. 2007)
"We've encountered this tandem-gene duplication problem before;
it ususally isn't much of a problem for the ab-initio
predictors, other than they might merge or split some of them,
but genewise will sometimes borrow exons from different genes
to piece together its prediction."
- S. Chatterji and L. Pachter, Reference based annotation with
GeneMapper, Genome Biology 7 (2006), R29, p5.
"The GeneMapper algorithm is unable to account for certain
assembly and sequencing errors. For example, we found
many cases of duplicated chicken exons, most probably due to
errors in the assembly. In such cases there is no way to distinguish
between the duplicate exons, and the prediction is
made randomly among the duplicates. "
The gene mappers introduce errors in tandem gene regions,
and cause a bias toward the species with closer homology to
the mapping source proteins.
This bias has the effect
of reducing gene gain/increasing gene loss estimates, often
strongly, and any other analyses that rely on an unbiased
estimate of multi-gene families.
Daphnia gene duplicates,
Celegans, Daphnia, Drosophila, Mouse protein duplicates,
Tandem gene analyses
Don Gilbert, Aug 2007, firstname.lastname@example.org