2000 new D. melanogaster genes and coding exons
identified with phylogenetic comparisons
The data file all_caf1_DGIL_TEX.gff lists the
predicted Dmel coding exons that are not part of known genes, their
cross-species matches and predictions. These prediced exons are
provided in dmel-predexons (set 1, see below) and dmel-predexoninside (set 2)
files, which include additional predicted exons that did not meet criteria.
The dmel-*-texon.html files show this data
per Dmel chromosome, in tabular form with links to Gbrowse map views
for each exon. Below find a summary of methods and results.
GBrowse aligned genomes views of examples.
-- two new multi-exon X genes:
texon7201 .. texon7214 [R],
texon7216..texon7222 [C]
[R= RhoGAP1A;
C= CG17707?]
-- two new genes between 3L rho and stet:
texon1943, texon1946
-- two new genes above 2R dpr:
texon1660,61 texon1664
-- new Ubx exon,
texon11429;
new Antp intronic gene,
texon11244,45
[see GenBank:BK002361];
and new Adb-B exon,
texon11435
Name Last modified Size Description
Parent Directory 23-Aug-2006 21:37 -
newgenes-summary.txt 28-Aug-2006 18:14 4k
dmel-X-texon.html 23-Aug-2006 21:11 224k
dmel-U-texon.html 23-Aug-2006 21:11 329k
dmel-4-texon.html 23-Aug-2006 21:11 12k
dmel-3R-texon.html 23-Aug-2006 21:11 263k
dmel-3L-texon.html 23-Aug-2006 21:11 223k
dmel-2R-texon.html 23-Aug-2006 21:11 203k
dmel-2L-texon.html 23-Aug-2006 21:11 159k
all_caf1_DGIL_TEX.gff 23-Aug-2006 20:35 8.4M
dmel-predexons.fa 23-Aug-2006 17:46 734k
dmel-predexoninside.fa 23-Aug-2006 12:26 159k
dmel-predexoninside.gff 23-Aug-2006 12:25 68k
dmel-predexons.gff 22-Aug-2006 23:08 250k
Around 2000 new D melanogaster genes and exons are identified using
phylogenetic comparison of protein coding region matches.
Newly predicted exons in D.mel, outside of and inside of known genes,
were selected, and BLAST-compared to other species. Dmel predicted
exons are classified as valid if they had significant similarity and
a high synonymous substitution rate (low Ka_Ks) in one or more other
species.
About 1800 new exons were found outside of known gene regions, and 300 inside
known gene regions. Classification as new gene or new exon of existing
genes is not yet done. Probably these will boil down to (a) 2-3 exons of new genes
and (b) alternate splice exons of known genes. This is a preliminary rather
than exhaustive search of new Dmel genes. Additional new exons can likely be found
by including Dmel-sibling gene predictions.
-- Don Gilbert, august 2006
Average cross-species coding exon statistics.
Species N.exon Bitscore AlignIdent% Ka_Ks
------------------------------------------
dmel 2173 -- -- --
dsec 13476 539 94.8 0.660
dsim 6300 453 94.7 0.639
dyak 6590 393 93.5 0.610
dere 3870 310 92.6 0.593
dana 4953 90 90.1 0.382
dpse 4057 67 93.6 0.235
dmoj 3217 61 93.6 0.190
------------------------------------------
Prediction methods success rate at matching new exons
in above non-Dmel species.
New Dmel exon
Species Method prediction matches
------------------------------------------
all DGIL_SNO 71.2%
all DGIL_SNP 69.3%
all RGUI_GID_mRNA 53.5%
all GLEAN 52.5%
all BREN_NSC 50.4%
all GLEANR 44.2%
all BATZ_CNA 42%
all NCBI_GNO 37.7%
all none 15.4%
all EISE_CEX 11.1%
all OXFD_GPI 7.5%
all EISE_CGW 7.3%
all EISE_CGM 7%
all PACH_GMP 6.2%
all JIGSAW 4.5%
all Total HSPs 44648 (exons matched in all species)
------------------------------------------
Find new gene data here as tables per chromosome arm with map
views, and GFF format, Dmel + cross species matches.
http://insects.eugenes.org/DroSpeGe/data/dmel-dspp/newgenes/
METHODS
-------
1. Select Dmel gene predicted coding exons that are outside of known
genes or transposons, and have consensus of two or more predictors
(set 1, ID texon1..texon9000), and predicted exons from 2+ predictors
inside gene boundaries but outside known exons or transposons (set 2, ID
texon10000..12000).
The DroSpeGe Biomart database was used for this, created
from Drosophila CAF1 assemblies and annotations. The web
interface allows this selection in part
(http://insects.eugenes.org/BioMart/martview/)
The Flybase Dmel release 4.2 gene/feature set was used for known
features. Note that some genes added as known in release 4.3 are
also found in these results. Current Drospege map views
display these, and a later update will remove them as texon's.
Coding exon predictions for Dmel are drawn from these groups:
exon_DGIL_SNO CDS_BATZ_CON CDS_RGUI_GID CDS_NCBI_GNO
Extract GFF and Fasta sequence for these Dmel predicted new exons.
Set 1 predicted exons are in files dmel-predexons.gff, .fa
and set 2 are found in dmel-predexoninside.gff, .fa
2. BLASTn new exons against species genomes, using -e 1e-3.
Parse blast alignment output for conserved/changed alignment
for significant matches. Calculate Ka/Ks ratio using
alignment mismatches (excluding gaps), and exon codon position
1,2 = amino/nonsynonymous, 3 = synonymous changes.
3. Select putative new Dmel exons with at least one species
showing significant alignment and Ka/Ks < 1. Convert to GFF
for all species with exons matching, and extract overlapping
exon predictions in those species. These results are in
file all_caf1_DGIL_TEX.gff, sorted by texon ID and species.
|