DroSpeGe About Arthropods BLAST BioMart Maps Data News

Finding Genes in Drosophila RNA-Seq data
Gene assembly for rGASP, Sept. 2009

Drosophila melanogaster tile expression gene predictions
from Affymetrix modENCODE transcriptome data

by Don Gilbert, 2008 May, gilbertd@indiana.edu

This is a set of Drosophila melanogaster gene predictions from Affymetrix tile expression (modENCODE & Manak'06) data sets with a few summaries. The approach used is to combine gene prediction software with tile transcription expression to produce gene models (with splice sites, start/stop and proteins) from this genome-wide transcription evidence. The intention, and approximate result, is a gene model view of this transcription. There is high concordance (83%) between tile transcription fragments and predicted exons.

Key points in the first 3 graphs: most known genes are expressed in all conditions (cell lines, dev. times) while most newly predicted tile-genes are expressed in only some subset of conditions. The expression score is lower for genes expressed in a subset of conditions, whether known or new. See below Tile-predicted gene summary.

A second clear value is that 880 third party gene annotations (TPA) based on RT-PCR experiments have now been located on the genome with this tile expression evidence. See below New tile-predicted genes.

There are a lot of newly predicted genes, 70% additional exon-bases compared to the reference gene set. It is ambiguous how many of these are real, but alternate evidence exists for about 1/4 of them (EST, homology, alternate predictions). The tile transcription signal lacks information on gene boundaries and direction. As a result, these gene models are less accurate than reference genes, with roughly 2 tile-called genes per reference gene. However the exon calls are more accurate.

Gene and tile expression views at http://insects.eugenes.org/cgi-bin/gbrowsenew/gbrowse/drosmel5dg/
These data and summaries are available at http://insects.eugenes.org/DroSpeGe/data/dmel5/modencode/ or ftp://eugenes.org/eugenes/genomes/dmel5/modencode/

2008 Sept: Big Map of Transcriptome expression

2009 March: Tile expression at Gene Structure (UTR, Intron-Exon) boundaries

Tile-predicted gene summary

Frequency of genes found in 1 to n expression groups. Shows most known genes are expressed in all treatments, most new genes are found in only some treatements. dmel5-augtilepairaf : affy 38bp tile expression for 21 cell lines (Mar-2008) dmel5-augtilepairmk : affy Manak 2006 tile expression for 12 development stages Density distribution of tile expression scores/gene, separated by genes found in all groups versus genes found in some (1 to n-1) groups. Shows known and new genes found have same score distribution. Expression level is higher for all-group genes. dmel5-augtilescoredistaf : affy 38bp tile expression score distribution/gene. Frequency of genes in each expression group, excluding genes found in all groups. Known and new genes are found in the same relative proportions across groups. dmel5-augtilegroupaf2 : affy 38bp tile expression for 21 cell lines (Mar-2008) dmel5-augtilegroupmk2 : affy Manak 2006 tile expression for 12 development stages Plots are for DroMel chr2R only. On these figures, there are three gene sets: fbgene is the flybase reference gene set, tilefb are the tile-predicted genes that overlap the same fbgene set, tilenew are the tile-predicted genes that do not overlap fbgene set. So the fbgene and tilefb cover the same known genes, and should show about the same effect. The legends leave something to be desired on these preliminary charts. The predicted genes are provided below as dmel5-augmap30an.gff.gz (locations with annotations) and dmel5-augmap30.aa.gz (proteins)

Methods of predicting genes from tile expression

The Augustus gene predictor is used, fully trained on current Drosophila melanogaster EST-based genes from PASA's EST assembly pipeline, using 500K DrosMel ESTs in GenBank (March 2008). Thes EST assembly data set is available at DroSpeGe/data/dmel5/PASA_EST/ As well as training Augustus, this provides new DrosMel gene evidence and valuable information on gene annotation conflicts. Evidence from tile expression transcription fragments are used as prediction hints with high weight. A few software modifications to Augustus are added to effectively use these tile data. Affymetrix tile expression transfrag data sets used are listed below. Of 192409 predicted exons in Augustus-tilex run30 (2008 march), 160088 have tile transfrag matches (83%) using these methods. Of 69,000 distinct gene models, some 28500 are at known gene locations ( 2 predicted / 1 known), with 40720 new tile-gene predictions. The gene models produced with this forced match to tile evidence are short; predictions at known genes average about 2 to 3 tile-genes per known gene. There remain tile transfrags that do not have predicted exons; however most, 733858/1001098 of distinct transfrags, have corresponding exon calls (73%). There are also known genes with no expression in these studies.

New tile-predicted genes

dmel5-augmap30newrgns.html : List and maps for regions with 2+ new genes This lists a subset of regions (2+ Kb) without reference genes, with new tile-predicted genes. A table locating 880 TPA genes here dmel5-aug30-tpa.html lists their locations and GenBank IDs. Some 1800 other new tile-genes found here have Genbank protein matches, listed in dmel5-aug30-oprot.html. Alternate evidence for these tile-new predicted genes includes Homology: Of 38200 total tile-new genes (or 85209 tile-new exons), 2724 new genes have some homology (with blastp 1e-3 to NCBI NR db) 8011 new exons have homology Homology is in these groups (excluding Uextra, another 2000): Drosophila= 1633, Insect= 26, TE-gene= 966, other= 99 The Drosophila matches are mostly DrosMel including 880 third party annotations (TPA). ESTs: Of 72379 total tile-expressed exons w/ ESTs, 5768 are new exons. Other-predicted: Of 55379 total tile-expressed exons with NCBI_GNO CAF1 predictions, 12787 are new exons (co-predicted by Gnomon CAF1 and Augustus-tile data). Any of the above Of 106903 tile-exons with homology, EST or other-predict, 28992 are new exons. Of 39181 tile-genes with above, 12951 are new genes. This analysis does not yet discriminate between new alternate transcripts/exons of known genes and distinct new genes. What is called new are those models with no exon overlap to known genes. Some of these are expected to belong to known genes. E.g., 3 new mod(mgd4) alternate-exons were detected (the odd transpliced gene in DrosMel with about 30 known alt-transcripts). Many new genes are in regions devoid of reference genes (see above list of large new-gene regions).

Total CDS bases/genome compared

The important numbers here are c/t=, fraction of coding bases/ total genome bases. Key: ntr: number transcripts; n:number CDS-exons; m:mean exon size; cds:cds bases, tb:total gene-region genome bases, c/t: cds/total ratio # DrosMel, affy transfrags + augustus predicts (excluding Uextra) CDSbases dmel5-aug30: ntr=64433, n=163447, m=233.26, cds=38126023, tb=139637899, c/t=0.273 # Flybase r5.5 CDS CDSbases dmel5.5r: ntr=20924, n=56885, m=401.51, cds=22839944, tb=139214367, c/t=0.164 Compared with Daphnia Nimblegen data # Augustus with TAR, augmap19.gff, all scaffolds CDSbases dpx1-aug19 : ntr=56928, n=197323, m=214.95, cds=42413860, tb=162548203, c/t=0.261 # Daphnia v1.1 genome gene set coding sequence bases / total genome bases CDSbases dpx1-Gnomon : ntr=37466 n=151668, m=237.45, cds=36014074, tb=200738384, c/t=0.179 John Manak's study (Nature genetics, 2006, doi:10.1038/ng1875) w/ DrosMel tile array expression suggests 30% transcription outside predicted genes, e.g. DrosMel c/t=0.24 versus DrosMel known genes c/t=0.18

Affy modENCODE cell-line groups

Drosophila melanogaster transcription tile expression from Affymextrix (transcriptome.affymetrix.com) for 21 cell lines were used (38bp-arrays, March 2008, modENCODE project), in the form of transfrag data. Transfrag data are the set of 'transcription fragments' as analyzed by Affymetrix methods from raw signal data, based on a minimum number of consecutive high scoring tiles. The transfrag data set used here is bandwidth0_maxgap90_minrun50 (no window/bandwidth smoothing, with 90 base max. gap and 50 base minumum consecutive tiles) from the 38-base tiles array. gr_558 Dro2_AS_CME-L1 leg disc gr_559 Dro2_AS_Sg4 embryo gr_560 Dro2_AS_ML-DmD11 eye-antenna disc gr_561 Dro2_AS_ML-DmD20c2 antenna disc gr_562 Dro2_AS_ML-DmD20c5 antenna disc gr_563 Dro2_AS_Kc167 embryo gr_564 Dro2_AS_GM2 embryo gr_565 Dro2_AS_S2-DRSC embryo isolate of S2 used for RNAi in the DRSC gr_566 Dro2_AS_S2R+ embryo gr_567 Dro2_AS_S1 embryo gr_568 Dro2_AS_1182-4H embryo haploid gr_569 Dro2_AS_ML-DmD16c3 wing disc gr_570 Dro2_AS_ML-DmD32 wing disc gr_571 Dro2_AS_ML-DmD17c3 haltere disc gr_572 Dro2_AS_ML-DmD8 wing disc gr_577 Dro2_AS_CME-W1-CL8 wing disc gr_578 Dro2_AS_Dm_emb_2h gr_579 Dro2_AS_Dm_emb_2h_RWP+ gr_580 Dro2_AS_ML-DmD9_C01 wing disc (?) gr_581 Dro2_AS_ML-DmBG1c1 CNS gr_582 Dro2_AS_ML-DmD21 wing disc cell types described at https://dgrc.cgb.indiana.edu/cells/store/catalog.html Manak study 12 development time groups (Dro_Total_AS_n_B1) AS_1_B1 AS_2_B1 AS_3_B1 AS_4_B1 AS_5_B1 AS_6_B1 AS_7_B1 AS_8_B1 AS_9_B1 AS_10_B1 AS_11_B1 AS_12_B1

Gene prediction annotations

The gene prediction locations and proteins are provided in files dmel5-augmap30an.gff.gz and dmel5-augmap30.aa.gz. Annotation fields in dmel5-augmap30an.gff are tf= transfrag overlap for Affymetrix modENCODE and Manak tiles, with treatment group ID (mkAS1 = Manak Dro_Total_AS_1_B1; tf564 = Affy transfrag 564 Dro2_AS_GM2) xid= known, reference exon overlap (FB DrosMel r5.5), with gene ID est= EST overlap with GenBank ID prot= Protein homology (GenBank ID/taxonid/description of best hit to NCBI NR BlastP) pred= NCBI Gnomon CAF1 predicted exon overlap te= transposable element overlap (FB DrosMel r5.5), with TE ID pct_support= Augustus percent support of model from evidence evd_fTF,evd_pTF= Augustus evidence from transfrags (see tf=) evd_fE, evd_pE= Augustus evidence from cDNA/EST PASA assemblies Annotation overlap criteria of 50% bases is used but for EST overlap of 80%.

Developed at the Genome Informatics Lab of Indiana University Biology Department