DroSpeGe About Arthropods BLAST BioMart Maps Data News

2000 new D. melanogaster genes and coding exons
identified with phylogenetic comparisons

The data file all_caf1_DGIL_TEX.gff lists the predicted Dmel coding exons that are not part of known genes, their cross-species matches and predictions. These prediced exons are provided in dmel-predexons (set 1, see below) and dmel-predexoninside (set 2) files, which include additional predicted exons that did not meet criteria. The dmel-*-texon.html files show this data per Dmel chromosome, in tabular form with links to Gbrowse map views for each exon. Below find a summary of methods and results.

GBrowse aligned genomes views of examples.
-- two new multi-exon X genes: texon7201 .. texon7214 [R], texon7216..texon7222 [C] [R= RhoGAP1A; C= CG17707?]
-- two new genes between 3L rho and stet: texon1943, texon1946
-- two new genes above 2R dpr: texon1660,61 texon1664
-- new Ubx exon, texon11429; new Antp intronic gene, texon11244,45 [see GenBank:BK002361]; and new Adb-B exon, texon11435

      Name                    Last modified       Size  Description

[DIR] Parent Directory 23-Aug-2006 21:37 - [TXT] newgenes-summary.txt 28-Aug-2006 18:14 4k [TXT] dmel-X-texon.html 23-Aug-2006 21:11 224k [TXT] dmel-U-texon.html 23-Aug-2006 21:11 329k [TXT] dmel-4-texon.html 23-Aug-2006 21:11 12k [TXT] dmel-3R-texon.html 23-Aug-2006 21:11 263k [TXT] dmel-3L-texon.html 23-Aug-2006 21:11 223k [TXT] dmel-2R-texon.html 23-Aug-2006 21:11 203k [TXT] dmel-2L-texon.html 23-Aug-2006 21:11 159k [TXT] all_caf1_DGIL_TEX.gff 23-Aug-2006 20:35 8.4M [TXT] dmel-predexons.fa 23-Aug-2006 17:46 734k [TXT] dmel-predexoninside.fa 23-Aug-2006 12:26 159k [TXT] dmel-predexoninside.gff 23-Aug-2006 12:25 68k [TXT] dmel-predexons.gff 22-Aug-2006 23:08 250k


Around 2000 new D melanogaster genes and exons are identified using 
phylogenetic comparison of protein coding region matches.

Newly predicted exons in D.mel, outside of and inside of known genes,
were selected, and BLAST-compared to other species.  Dmel predicted
exons are classified as valid if they had significant similarity and
a high synonymous substitution rate (low Ka_Ks) in one or more other
species.

About 1800 new exons were found outside of known gene regions, and 300 inside
known gene regions.  Classification as new gene or new exon of existing
genes is not yet done.  Probably these will boil down to (a) 2-3 exons of new genes
and (b) alternate splice exons of known genes. This is a preliminary rather 
than exhaustive search of new Dmel genes.  Additional new exons can likely be found
by including Dmel-sibling gene predictions.

-- Don Gilbert, august 2006

Average cross-species coding exon statistics.

Species N.exon Bitscore  AlignIdent% Ka_Ks
------------------------------------------
dmel    2173        --      --        --
dsec   13476       539     94.8      0.660
dsim    6300       453     94.7      0.639
dyak    6590       393     93.5      0.610
dere    3870       310     92.6      0.593
dana    4953        90     90.1      0.382
dpse    4057        67     93.6      0.235
dmoj    3217        61     93.6      0.190
------------------------------------------

Prediction methods success rate at matching new exons
in above non-Dmel species.
                     New Dmel exon 
Species Method       prediction matches
------------------------------------------
all     DGIL_SNO        71.2%
all     DGIL_SNP        69.3%
all     RGUI_GID_mRNA   53.5%
all     GLEAN           52.5%
all     BREN_NSC        50.4%
all     GLEANR          44.2%
all     BATZ_CNA        42%
all     NCBI_GNO        37.7%
all     none            15.4%
all     EISE_CEX        11.1%
all     OXFD_GPI         7.5%
all     EISE_CGW         7.3%
all     EISE_CGM         7%
all     PACH_GMP         6.2%
all     JIGSAW           4.5%
all     Total HSPs     44648 (exons matched in all species)
------------------------------------------

Find new gene data here as tables per chromosome arm with map
views, and  GFF format, Dmel + cross species matches.
http://insects.eugenes.org/DroSpeGe/data/dmel-dspp/newgenes/


METHODS
-------

1. Select Dmel gene predicted coding exons that are outside of known
genes or transposons, and have consensus of two or more predictors
(set 1, ID texon1..texon9000), and predicted exons from 2+ predictors
inside gene boundaries but outside known exons or transposons (set 2, ID
texon10000..12000).

The DroSpeGe Biomart database was used for this, created
from Drosophila CAF1 assemblies and annotations. The web
interface allows this selection in part 
(http://insects.eugenes.org/BioMart/martview/)

The Flybase Dmel release 4.2 gene/feature set was used for known
features. Note that some genes added as known in release 4.3 are
also found in these results.  Current Drospege map views
display these, and a later update will remove them as texon's.

Coding exon predictions for Dmel are drawn from these groups: 
exon_DGIL_SNO CDS_BATZ_CON CDS_RGUI_GID CDS_NCBI_GNO 

Extract GFF and Fasta sequence for these Dmel predicted new exons.
Set 1 predicted exons are in files dmel-predexons.gff, .fa
and set 2 are found in dmel-predexoninside.gff, .fa

2. BLASTn new exons against species genomes, using -e 1e-3.
Parse blast alignment output for conserved/changed alignment
for significant matches.   Calculate Ka/Ks ratio using
alignment mismatches (excluding gaps), and exon codon position
1,2 = amino/nonsynonymous, 3 = synonymous changes.

3. Select putative new Dmel exons with at least one species
showing significant alignment and Ka/Ks < 1.  Convert to GFF
for all species with exons matching, and extract overlapping
exon predictions in those species.  These results are in
file all_caf1_DGIL_TEX.gff, sorted by texon ID and species.



Developed at the Genome Informatics Lab of Indiana University Biology Department