Around 2000 new D melanogaster genes and exons are identified using phylogenetic comparison of protein coding region matches. Newly predicted exons in D.mel, outside of and inside of known genes, were selected, and BLAST-compared to other species. Dmel predicted exons are classified as valid if they had significant similarity and a high synonymous substitution rate (low Ka_Ks) in one or more other species. About 1800 new exons were found outside of known gene regions, and 300 inside known gene regions. Classification as new gene or new exon of existing genes is not yet done. Probably these will boil down to (a) 2-3 exons of new genes and (b) alternate splice exons of known genes. This is a preliminary rather than exhaustive search of new Dmel genes. Additional new exons can likely be found by including Dmel-sibling gene predictions. -- Don Gilbert, august 2006 Average cross-species coding exon statistics. Species N.exon Bitscore AlignIdent% Ka_Ks ------------------------------------------ dmel 2173 -- -- -- dsec 13476 539 94.8 0.660 dsim 6300 453 94.7 0.639 dyak 6590 393 93.5 0.610 dere 3870 310 92.6 0.593 dana 4953 90 90.1 0.382 dpse 4057 67 93.6 0.235 dmoj 3217 61 93.6 0.190 ------------------------------------------ Prediction methods success rate at matching new exons in above non-Dmel species. New Dmel exon Species Method prediction matches ------------------------------------------ all DGIL_SNO 71.2% all DGIL_SNP 69.3% all RGUI_GID_mRNA 53.5% all GLEAN 52.5% all BREN_NSC 50.4% all GLEANR 44.2% all BATZ_CNA 42% all NCBI_GNO 37.7% all none 15.4% all EISE_CEX 11.1% all OXFD_GPI 7.5% all EISE_CGW 7.3% all EISE_CGM 7% all PACH_GMP 6.2% all JIGSAW 4.5% all Total HSPs 44648 (exons matched in all species) ------------------------------------------ Find new gene data here as tables per chromosome arm with map views, and GFF format, Dmel + cross species matches. http://insects.eugenes.org/DroSpeGe/data/dmel-dspp/newgenes/ METHODS ------- 1. Select Dmel gene predicted coding exons that are outside of known genes or transposons, and have consensus of two or more predictors (set 1, ID texon1..texon9000), and predicted exons from 2+ predictors inside gene boundaries but outside known exons or transposons (set 2, ID texon10000..12000). The DroSpeGe Biomart database was used for this, created from Drosophila CAF1 assemblies and annotations. The web interface allows this selection in part (http://insects.eugenes.org/BioMart/martview/) The Flybase Dmel release 4.2 gene/feature set was used for known features. Note that some genes added as known in release 4.3 are also found in these results. Current Drospege map views display these, and a later update will remove them as texon's. Coding exon predictions for Dmel are drawn from these groups: exon_DGIL_SNO CDS_BATZ_CON CDS_RGUI_GID CDS_NCBI_GNO Extract GFF and Fasta sequence for these Dmel predicted new exons. Set 1 predicted exons are in files dmel-predexons.gff, .fa and set 2 are found in dmel-predexoninside.gff, .fa 2. BLASTn new exons against species genomes, using -e 1e-3. Parse blast alignment output for conserved/changed alignment for significant matches. Calculate Ka/Ks ratio using alignment mismatches (excluding gaps), and exon codon position 1,2 = amino/nonsynonymous, 3 = synonymous changes. 3. Select putative new Dmel exons with at least one species showing significant alignment and Ka/Ks < 1. Convert to GFF for all species with exons matching, and extract overlapping exon predictions in those species. These results are in file all_caf1_DGIL_TEX.gff, sorted by texon ID and species.