Drosophila melanogaster tile expression gene predictions
from Affymetrix modENCODE transcriptome data
by Don Gilbert, 2008 May, gilbertd@indiana.edu
This is a set of Drosophila melanogaster
gene predictions from Affymetrix tile expression (modENCODE
& Manak'06) data sets with a few summaries. The approach used is to combine gene
prediction software with tile transcription expression to produce gene models (with
splice sites, start/stop and proteins) from this genome-wide transcription evidence.
The intention, and approximate result, is a gene model view of this transcription.
There is high concordance (83%) between tile transcription fragments and predicted
exons.
Key points in the first 3 graphs: most known genes are expressed in all conditions
(cell lines, dev. times) while most newly predicted tile-genes are expressed in only
some subset of conditions. The expression score is lower for genes expressed in a
subset of conditions, whether known or new.
See below Tile-predicted gene summary.
A second clear value is that 880 third party gene annotations (TPA) based on RT-PCR
experiments have now been located on the genome with this tile
expression evidence. See below New tile-predicted genes.
There are a lot of newly predicted genes, 70% additional exon-bases compared to the
reference gene set. It is ambiguous how many of these are real, but alternate
evidence exists for about 1/4 of them (EST, homology, alternate predictions). The
tile transcription signal lacks information on gene boundaries and direction. As a
result, these gene models are less accurate than reference genes, with roughly 2
tile-called genes per reference gene. However the exon calls are more accurate.
Gene and tile expression views at
http://insects.eugenes.org/cgi-bin/gbrowsenew/gbrowse/drosmel5dg/
These data and summaries are available at
http://insects.eugenes.org/DroSpeGe/data/dmel5/modencode/
or ftp://eugenes.org/eugenes/genomes/dmel5/modencode/
Tile-predicted gene summary
Frequency of genes found in 1 to n expression groups. Shows most known genes are expressed
in all treatments, most new genes are found in only some treatements.
dmel5-augtilepairaf : affy 38bp tile expression for 21 cell lines (Mar-2008)
dmel5-augtilepairmk : affy Manak 2006 tile expression for 12 development stages
Density distribution of tile expression scores/gene, separated by genes found in all groups
versus genes found in some (1 to n-1) groups. Shows known and new genes found have same
score distribution. Expression level is higher for all-group genes.
dmel5-augtilescoredistaf : affy 38bp tile expression score distribution/gene.
Frequency of genes in each expression group, excluding genes found in all groups.
Known and new genes are found in the same relative proportions across groups.
dmel5-augtilegroupaf2 : affy 38bp tile expression for 21 cell lines (Mar-2008)
dmel5-augtilegroupmk2 : affy Manak 2006 tile expression for 12 development stages
Plots are for DroMel chr2R only.
On these figures, there are three gene sets: fbgene is the flybase reference gene set,
tilefb are the tile-predicted genes that overlap the same fbgene set,
tilenew are the tile-predicted genes that do not overlap fbgene set.
So the fbgene and tilefb cover the same known genes, and should show about
the same effect. The legends leave something to be desired on these
preliminary charts.
The predicted genes are provided below as dmel5-augmap30an.gff.gz (locations with annotations)
and dmel5-augmap30.aa.gz (proteins)
Methods of predicting genes from tile expression
The Augustus gene predictor is used, fully trained on current Drosophila melanogaster
EST-based genes from PASA's EST assembly pipeline, using 500K DrosMel ESTs in GenBank
(March 2008). Thes EST assembly data set is available at DroSpeGe/data/dmel5/PASA_EST/
As well as training Augustus, this provides new DrosMel gene evidence and valuable
information on gene annotation conflicts.
Evidence from tile expression transcription fragments are used as
prediction hints with high weight. A few software modifications to Augustus are added
to effectively use these tile data. Affymetrix tile expression transfrag data sets
used are listed below.
Of 192409 predicted exons in Augustus-tilex run30 (2008 march), 160088 have tile
transfrag matches (83%) using these methods. Of 69,000 distinct gene models,
some 28500 are at known gene locations ( 2 predicted / 1 known), with
40720 new tile-gene predictions. The gene models produced with this
forced match to tile evidence are short; predictions at known genes average about
2 to 3 tile-genes per known gene.
There remain tile transfrags that do not have predicted exons; however most,
733858/1001098 of distinct transfrags, have corresponding exon calls (73%). There are
also known genes with no expression in these studies.
New tile-predicted genes
dmel5-augmap30newrgns.html : List and maps for regions with 2+ new genes
This lists a subset of regions (2+ Kb) without reference genes, with new tile-predicted genes.
A table locating 880 TPA genes here dmel5-aug30-tpa.html lists their
locations and GenBank IDs. Some 1800 other new tile-genes found here have
Genbank protein matches, listed in dmel5-aug30-oprot.html.
Alternate evidence for these tile-new predicted genes includes
Homology:
Of 38200 total tile-new genes (or 85209 tile-new exons),
2724 new genes have some homology (with blastp 1e-3 to NCBI NR db)
8011 new exons have homology
Homology is in these groups (excluding Uextra, another 2000):
Drosophila= 1633, Insect= 26, TE-gene= 966, other= 99
The Drosophila matches are mostly DrosMel including 880 third party annotations (TPA).
ESTs:
Of 72379 total tile-expressed exons w/ ESTs,
5768 are new exons.
Other-predicted:
Of 55379 total tile-expressed exons with NCBI_GNO CAF1 predictions,
12787 are new exons (co-predicted by Gnomon CAF1 and Augustus-tile data).
Any of the above
Of 106903 tile-exons with homology, EST or other-predict,
28992 are new exons.
Of 39181 tile-genes with above,
12951 are new genes.
This analysis does not yet discriminate between new alternate transcripts/exons of
known genes and distinct new genes. What is called new are those models with no exon
overlap to known genes. Some of these are expected to belong to known genes. E.g.,
3 new mod(mgd4) alternate-exons were detected (the odd transpliced gene in DrosMel
with about 30 known alt-transcripts). Many new genes are in regions devoid of
reference genes (see above list of large new-gene regions).
Total CDS bases/genome compared
The important numbers here are c/t=, fraction of coding bases/ total genome bases.
Key: ntr: number transcripts; n:number CDS-exons; m:mean exon size;
cds:cds bases, tb:total gene-region genome bases, c/t: cds/total ratio
# DrosMel, affy transfrags + augustus predicts (excluding Uextra)
CDSbases dmel5-aug30: ntr=64433, n=163447, m=233.26, cds=38126023, tb=139637899, c/t=0.273
# Flybase r5.5 CDS
CDSbases dmel5.5r: ntr=20924, n=56885, m=401.51, cds=22839944, tb=139214367, c/t=0.164
Compared with Daphnia Nimblegen data
# Augustus with TAR, augmap19.gff, all scaffolds
CDSbases dpx1-aug19 : ntr=56928, n=197323, m=214.95, cds=42413860, tb=162548203, c/t=0.261
# Daphnia v1.1 genome gene set coding sequence bases / total genome bases
CDSbases dpx1-Gnomon : ntr=37466 n=151668, m=237.45, cds=36014074, tb=200738384, c/t=0.179
John Manak's study (Nature genetics, 2006, doi:10.1038/ng1875) w/ DrosMel
tile array expression suggests 30% transcription outside predicted genes,
e.g. DrosMel c/t=0.24 versus DrosMel known genes c/t=0.18
Affy modENCODE cell-line groups
Drosophila melanogaster transcription tile expression from Affymextrix
(transcriptome.affymetrix.com) for 21 cell lines were used (38bp-arrays, March 2008,
modENCODE project), in the form of transfrag data. Transfrag data are the set of
'transcription fragments' as analyzed by Affymetrix methods from raw signal data,
based on a minimum number of consecutive high scoring tiles. The transfrag data set
used here is bandwidth0_maxgap90_minrun50 (no window/bandwidth smoothing, with 90
base max. gap and 50 base minumum consecutive tiles) from the 38-base tiles array.
gr_558 Dro2_AS_CME-L1 leg disc
gr_559 Dro2_AS_Sg4 embryo
gr_560 Dro2_AS_ML-DmD11 eye-antenna disc
gr_561 Dro2_AS_ML-DmD20c2 antenna disc
gr_562 Dro2_AS_ML-DmD20c5 antenna disc
gr_563 Dro2_AS_Kc167 embryo
gr_564 Dro2_AS_GM2 embryo
gr_565 Dro2_AS_S2-DRSC embryo isolate of S2 used for RNAi in the DRSC
gr_566 Dro2_AS_S2R+ embryo
gr_567 Dro2_AS_S1 embryo
gr_568 Dro2_AS_1182-4H embryo haploid
gr_569 Dro2_AS_ML-DmD16c3 wing disc
gr_570 Dro2_AS_ML-DmD32 wing disc
gr_571 Dro2_AS_ML-DmD17c3 haltere disc
gr_572 Dro2_AS_ML-DmD8 wing disc
gr_577 Dro2_AS_CME-W1-CL8 wing disc
gr_578 Dro2_AS_Dm_emb_2h
gr_579 Dro2_AS_Dm_emb_2h_RWP+
gr_580 Dro2_AS_ML-DmD9_C01 wing disc (?)
gr_581 Dro2_AS_ML-DmBG1c1 CNS
gr_582 Dro2_AS_ML-DmD21 wing disc
cell types described at https://dgrc.cgb.indiana.edu/cells/store/catalog.html
Manak study 12 development time groups (Dro_Total_AS_n_B1)
AS_1_B1 AS_2_B1 AS_3_B1 AS_4_B1 AS_5_B1 AS_6_B1 AS_7_B1 AS_8_B1
AS_9_B1 AS_10_B1 AS_11_B1 AS_12_B1
Gene prediction annotations
The gene prediction locations and proteins are provided in files
dmel5-augmap30an.gff.gz and dmel5-augmap30.aa.gz.
Annotation fields in dmel5-augmap30an.gff are
tf= transfrag overlap for Affymetrix modENCODE and Manak tiles, with treatment group ID
(mkAS1 = Manak Dro_Total_AS_1_B1; tf564 = Affy transfrag 564 Dro2_AS_GM2)
xid= known, reference exon overlap (FB DrosMel r5.5), with gene ID
est= EST overlap with GenBank ID
prot= Protein homology (GenBank ID/taxonid/description of best hit to NCBI NR BlastP)
pred= NCBI Gnomon CAF1 predicted exon overlap
te= transposable element overlap (FB DrosMel r5.5), with TE ID
pct_support= Augustus percent support of model from evidence
evd_fTF,evd_pTF= Augustus evidence from transfrags (see tf=)
evd_fE, evd_pE= Augustus evidence from cDNA/EST PASA assemblies
Annotation overlap criteria of 50% bases is used but for EST overlap of 80%.
|