DroSpeGe About BLAST BioMart Maps Data News

Augustus Tile Expression calls for Daphnia pulex

2008 February

Tile expression outside of gene predictions

The following samples are drawn from tests on Daphnia scaffold_4 (containing 8 Hb tandem genes and others).
List of scaffold_4 Tile expression outside gene predictions with map views.


Sample Augustus-Tile expression predictions

Example 3 Tile expr. in unpredicted region

Tile expr. hints call genes with strong Tile expression, missed by other predictors.
Tracks: Tarallx = tile expression (TAR) data used for predictions, abstracted from 'Genome Tile Expression' using runs of data, control data. Augustus-noevid+pasa = No evidence prediction from Daphnia-gene trained Augustus. Aug+TarAll,+Tarall9 = Exon-part hints from TAR data, various weights. Aug+TArAll.14,.17,.18 = CDS-part hints from TAR data, various weights. Gnomon, JGI = other predictions

Example 2 Tile expr. in unpredicted region

Many exons supported by low/medium Tile expr. CDS-hints call several short genes w/ CDS. Exon-hints call fewer, long-exon genes w/ no CDS. vMiss2.png

Tandem genes mismodelled by Gnomon/JGI

Augustus trained, no evidence calls this region best. TAR hints add a few exons but reduce gene model quality. vDup7.png

Hemoglobin 8-tandem region

No-evidence gene models are most accurate. Tile hints break genes into parts. Note also several valid exons lack TAR data due to high-identity exons (tile array requires unique oligos). vHb8.png

Example 1 Tile expr. in unpredicted region

vMiss1.png

Example 4 Tile expr. in unpredicted region

vMiss4.png

Augustus + TAR test runs w/ various hints, weights

I've tried several quality measures for assessing these, but this simple one seems most useful: counting CDS bases / total bases. Final run on all of genome used Test17pa: strong weights on CDS-hints for TAR hints only Baseline predition w/ no hints is Test 1pa.
CDSbases Test 1 :  ntr=386,     n=3405,  m=236.62,    cds=805701,     tb=3067984,     c/t=0.263
CDSbases Test 1pa: ntr=412,     n=3638,  m=232.80,    cds=846910,     tb=3067984,     c/t=0.276  + base predict, no TAR data
CDSbases Test 2 :  ntr=701,     n=2636,  m=261.11,    cds=688275,     tb=3067984,     c/t=0.224
CDSbases Test 3 :  ntr=737,     n=2234,  m=274.32,    cds=612840,     tb=3067984,     c/t=0.200
CDSbases Test 4 :  ntr=555,     n=3364,  m=239.09,    cds=804282,     tb=3067984,     c/t=0.262
CDSbases Test 5 :  ntr=641,     n=2849,  m=258.98,    cds=737841,     tb=3067984,     c/t=0.240
CDSbases Test 6 :  ntr=565,     n=3382,  m=240.33,    cds=812790,     tb=3067984,     c/t=0.265
CDSbases Test 7 :  ntr=541,     n=3425,  m=238.88,    cds=818151,     tb=3067984,     c/t=0.267
CDSbases Test 8 :  ntr=716,     n=1220,  m=387.49,    cds=472743,     tb=3074807,     c/t=0.154
CDSbases Test 9 :  ntr=765,     n=4039,  m=229.10,    cds=925320,     tb=3069252,     c/t=0.301  +
CDSbases Test 10 :  ntr=645,    n=2423,  m=277.91,    cds=673374,     tb=3069252,     c/t=0.219
CDSbases Test 11 :  ntr=600,    n=2985,  m=254.18,    cds=758730,     tb=3067984,     c/t=0.247
CDSbases Test 12 :  ntr=524,    n=4118,  m=227.18,    cds=935511,     tb=3068017,     c/t=0.305  +
CDSbases Test 13 :  ntr=428,    n=3701,  m=233.91,    cds=865719,     tb=3067984,     c/t=0.282
CDSbases Test 14 :  ntr=1684,   n=5858,  m=190.33,    cds=1114974,    tb=3070403,     c/t=0.363  +
CDSbases Test 16pa : ntr=1198,  n=4132,  m=188.78,    cds=780040,     tb=3074819,     c/t=0.254
CDSbases Test 17pa : ntr=1601,  n=5602,  m=207.05,    cds=1159882,    tb=3074819,     c/t=0.377  * 30% higher than base level
CDSbases Test 18pa : ntr=1124,  n=5042,  m=214.57,    cds=1081885,    tb=3074819,     c/t=0.352  *

ntr= number of transcripts/genes; n=no. CDS; m=mean CDS length; cds=sum of CDS-bases; 
tb=total bases in gene regions; c/t= CDS/total bases ratio

Sample Augustus HINTS configuration for TAR data

Notes:
  • Use only *parts hints as precise hints fail due to imprecision (multi-base) of TAR data.
  • Non-exon hints are possible from TAR data, but tests show available intron/intergenic hints are hurting gene calls, probably due to uncertainty.
  • Exon-part versus CDS-part: TAR data is for exon expression (including UTR) however using only TAR exon-part results in almost no CDS predictions, only nonCDS exons. Using CDS-part hints appears most effective, but gene models are short and partial compared to no-hint predictions.
    [SOURCES]
    M T P E
    # source of extrinsic information:
    # M manual anchor (required)
    # T transcript active regions from genome tile array
    # P protein database hit
    # E est database hit
    
    [SOURCE-PARAMETERS]
    E individual_liability
    P individual_liability
    T individual_liability
    
    #   feature        bonus         malus   gradelevelcolumns
    #		r+/r-
    # tests show G:intronpart,exonpart,CDSpart,irpart are useful
    # CDS,exon,intron start,stop,ass,dss are problematic: drop
    
    [GENERAL]
       exonpart        1       1  M    1  1e+100   P 1 1  E 1 1e+29 G 1 1e+9   
                		T 4 94 98 998 1 1e+19 1e+19 1e+35
     intronpart        1       1  M    1  1e+100   P 1 1  E 1 1e+9  G 1 1e+9   T 1 1e+8
        CDSpart        1       1  M    1  1e+100   P 4 40 90 140 1 1e+4 1e+7 1e+9   E 1 1  G 1 1e+15  
    		            T 4 94 98 998 1 1e+19 1e+19 1e+35
        UTRpart        1       1  M    1  1e+100   P 1 1  E 1  1e+5  G 1 1e+0   T 1 1
         irpart        1       1  M    1  1e+100   P 1 1  E 1  1e+5  G 1 1e+9   T 1 1e+8
    nonexonpart        1       1  M    1  1e+100   P 1 1  E 1  1e+5  G 1 1e+0   T 1 1e+8
    
    

    Sample TAR hints: Hb genes

    # dpulex-sc4hbregion-tar-all.gff
    # Hemoglobin region: (from PASA)
    # scaffold_4:2366641-2390393 
    # from annotated NCBI GNO genes
    # view at http://server2.eugenes.org/cgi-bin/gbrowsenew/gbrowse/dpxaugust/
    
    # Hb1 scaffold_4:2366621-2368356:+  
    scaffold_4	tarall	ep	2366607	2367052	95	.	.	g=Hb1.ex1-ex2;exp=male;exp99=male;ovr=1;src=T
    scaffold_4	tarall	ep	2366657	2366732	999	.	.	g=Hb1.ex1;exp=male;exp99=fem,metal;exp95=fem,metal;ovr=1;src=T
    scaffold_4	tarall	ep	2366677	2366732	999	.	.	g=Hb1.ex1;exp=fem,metal;src=T
    scaffold_4	tarall	ep	2366807	2367052	99	.	.	g=Hb1.ex2+;exp=fem,male;exp95=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2366877	2366952	999	.	.	g=Hb1.ex2-;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
    scaffold_4	tarall	ep	2367157	2367327	999	.	.	g=Hb1.ex3;exp=fem,male,metal;exp99=fem,male,metal;exp95=male,fem,metal;src=T
    scaffold_4	tarall	ep	2367432	2367552	99	.	.	g=Hb1.ex4;exp=fem,male,metal;exp95=male,fem,metal;exp999=fem,male,metal;src=T
    scaffold_4	tarall	ep	2367602	2368577	99	.	.	g=Hb1.ex5-ex7;exp=male;exp95=male,fem;ovr=1;src=T
    scaffold_4	tarall	ep	2367657	2368552	99	.	.	g=Hb1.ex5-ex7;exp=fem,metal;exp999=male,fem;exp95=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2367682	2367752	999	.	.	g=Hb1.ex5;exp=metal;src=T
    scaffold_4	tarall	ep	2367857	2368057	999	.	.	g=Hb1.ex6;exp=fem,metal;exp95=metal;exp99=metal;src=T
    scaffold_4	tarall	ep	2368132	2368332	95	.	.	g=Hb1.ex7;exp=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2368152	2368507	999	.	.	g=Hb1.ex7+;exp=male,fem;exp99=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2368207	2368332	999	.	.	g=Hb1.ex7-;exp=metal;src=T
    scaffold_4	tarall	ep	2368728	2368778	95	.	.	g=Hb1.ex7>;exp=male;src=T;flag=newexon:utr
    
    scaffold_4	Gnomon	mRNA	2366621	2368356	169.02	+	.	model=914044;ID=914044;Parent=gene914044;flags=EST,Prot,Start,Stop;protCDS=2366684 2368246;protein_hit=gi|5881967|gb|AAD55141.1|;support=1416043
    scaffold_4	Gnomon	exon	2366621	2366746	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2366863	2367023	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2367103	2367328	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2367424	2367548	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2367630	2367752	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2367831	2368059	.	+	.	model=914044;Parent=914044
    scaffold_4	Gnomon	exon	2368133	2368356	.	+	.	model=914044;Parent=914044
    
    
    # Hb2 scaffold_4:2370110-2371749:+
    scaffold_4	tarall	ep	2370033	2370203	95	.	.	exp=male,fem;ovr=1;src=T
    scaffold_4	tarall	ep	2370058	2370203	99	.	.	exp=male;ovr=1;src=T
    scaffold_4	tarall	ep	2370078	2370183	99	.	.	exp=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2370103	2370183	999	.	.	g=Hb2.ex1;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
    scaffold_4	tarall	ep	2370284	2370529	99	.	.	exp=male,fem;exp95=male,fem,metal;exp999=fem;src=T
    scaffold_4	tarall	ep	2370634	2371879	95	.	.	exp=male,fem,metal;exp99=male,fem,metal;exp999=fem,male,metal;src=T
    scaffold_4	tarall	ep	2370809	2371854	99	.	.	exp=male;exp95=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2370829	2370929	999	.	.	g=Hb2.ex4;exp=fem,male;exp99=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2370859	2370929	99	.	.	g=Hb2.ex4;exp=metal;exp95=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2370879	2370929	999	.	.	g=Hb2.ex4;exp=metal;src=T
    scaffold_4	tarall	ep	2370979	2371479	99	.	.	exp=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2371059	2371209	999	.	.	g=Hb2.ex5;exp=fem,male;ovr=1;src=T
    scaffold_4	tarall	ep	2371079	2371159	99	.	.	exp=metal;exp95=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2371109	2371159	999	.	.	g=Hb2.ex5-;exp=metal;src=T
    scaffold_4	tarall	ep	2371254	2371459	999	.	.	g=Hb2.ex6;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
    scaffold_4	tarall	ep	2371529	2371829	99	.	.	g=Hb2.ex7;exp=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2371554	2371829	999	.	.	g=Hb2.ex7;exp=male,fem,metal;exp95=metal;exp99=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2371679	2371729	999	.	.	g=Hb2.ex7-;exp=metal;src=T
    
    scaffold_4	Chainer	mRNA	2370110	2371749	163.191	+	.	model=1418043;ID=1418043;Parent=gene916044;flags=EST,Prot,Start,Stop,FullSupCDS;protein_hit=gi|5881967|gb|AAD55141.1|;support=1204041,1206041,1436041,1502041,1504041,1644041,2286041,2288041,2618041,2620041,4132041,7842042,7844042,7846042,7852042,*7854042
    scaffold_4	Chainer	exon	2370110	2370184	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2370276	2370436	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2370514	2370739	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2370819	2370943	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2371045	2371167	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2371238	2371466	.	+	.	model=1418043;Parent=1418043
    scaffold_4	Chainer	exon	2371535	2371749	.	+	.	model=1418043;Parent=1418043
    
    # Hb3 scaffold_4:2372736-2374380:+ 
    scaffold_4	tarall	ep	2372736	2372826	95	.	.	exp=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2372761	2372826	999	.	.	exp=male,metal;exp99=fem,male,metal;exp95=male,fem;src=T
    scaffold_4	tarall	ep	2372886	2373076	95	.	.	exp=fem;exp99=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2372901	2373151	95	.	.	exp=male;src=T
    scaffold_4	tarall	ep	2373609	2374255	95	.	.	exp=male;ovr=1;src=T
    scaffold_4	tarall	ep	2373660	2373805	95	.	.	exp=fem;exp99=male;ovr=1;src=T
    scaffold_4	tarall	ep	2373685	2373780	999	.	.	exp=fem,male;exp99=fem;ovr=1;src=T
    scaffold_4	tarall	ep	2373705	2373780	99	.	.	exp=metal;exp95=metal;src=T
    scaffold_4	tarall	ep	2373905	2374255	99	.	.	exp=fem,male,metal;exp95=fem,metal;exp999=fem,male,metal;src=T
    scaffold_4	tarall	ep	2374180	2374255	999	.	.	exp=male;exp95=metal;ovr=1;src=T
    scaffold_4	tarall	ep	2374205	2374255	999	.	.	exp=fem,metal;exp99=metal;src=T
    
    scaffold_4	Chainer	mRNA	2372736	2374380	172.894	+	.	model=1426043;ID=1426043;Parent=gene918044;flags=EST,Prot,Start,Stop,FullSupCDS;maxCDS=2372736 2374287;protCDS=2372776 2374275;protein_hit=gi|2105139|gb|AAC47544.1|;support=958041,960041,998041,1000041,1272041,1274041,2108041,7856042,7858042,7860042,*7862042,7864042,7866042,7868042
    scaffold_4	Chainer	exon	2372736	2372832	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2372901	2373058	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2373121	2373346	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2373419	2373543	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2373673	2373795	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2373868	2374093	.	+	.	model=1426043;Parent=1426043
    scaffold_4	Chainer	exon	2374171	2374380	.	+	.	model=1426043;Parent=1426043
    
    # Hb4 scaffold_4:2375586-2377700:+ ; missing tile data & weakly expressed
    scaffold_4	tarall	ep	2377333	2377403	95	.	.	g=Hb4.ex7-;exp=male,fem,metal;exp999=male,metal;exp99=fem,male,metal;src=T
    scaffold_4	tarall	ep	2377453	2377628	95	.	.	g=Hb4.ex8?;exp=male;ovr=1;src=T
    scaffold_4	tarall	ep	2377553	2377628	99	.	.	exp=fem,male;exp95=fem;exp999=fem;src=T
    
    scaffold_4	Gnomon	mRNA	2375586	2377700	175.064	+	.	model=920044;ID=920044;Parent=gene920044;flags=EST,Prot,Start,Stop;protCDS=2375724 2377558;protein_hit=gi|4589706|dbj|BAA76871.1|;support=1428043
    scaffold_4	Gnomon	exon	2375586	2375732	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2376093	2376143	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2376211	2376368	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2376433	2376658	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2376738	2376862	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2376975	2377097	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2377165	2377390	.	+	.	model=920044;Parent=920044
    scaffold_4	Gnomon	exon	2377451	2377700	.	+	.	model=920044;Parent=920044
    
    # Hb5 scaffold_4:2380765-2382320:+ ; ** several TAR spots missed due2 duplicate exon problem
    scaffold_4	tarall	ep	2380101	2380161	95	.	.	g=Hb5.ex1-;exp=fem;src=T;gene=Hb5.below
    scaffold_4	tarall	ep	2381276	2381336	95	.	.	g=Hb5.ex3;exp=fem;src=T;gene=Hb5.ex3
    scaffold_4	tarall	ep	2381983	2382033	999	.	.	g=Hb5.ex6;exp=fem,male;gene=Hb5.ex6;exp99=fem,male,metal;exp95=male,fem,metal;src=T
    scaffold_4	tarall	ep	2382210	2382276	99	.	.	g=Hb5.ex7;exp=fem;gene=hb5.ex7;exp95=male,fem;src=T
    
    scaffold_4	Gnomon	mRNA	2380765	2382320	172.019	+	.	model=924044;ID=924044;Parent=gene924044;flags=EST,Prot,Start,Stop;protCDS=2380774 2382201;protein_hit=gi|4589706|dbj|BAA76871.1|;support=1424043
    scaffold_4	Gnomon	exon	2380765	2380827	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2380895	2381052	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2381117	2381342	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2381416	2381540	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2381623	2381745	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2381813	2382038	.	+	.	model=924044;Parent=924044
    scaffold_4	Gnomon	exon	2382097	2382320	.	+	.	model=924044;Parent=924044
    
    Don Gilbert gilbertd [A] indiana.edu

  • Developed at the Genome Informatics Lab of Indiana University Biology Department