Augustus Tile Expression calls for Daphnia pulex
2008 February
Tile expression outside of gene predictions
The following samples are drawn from tests on Daphnia scaffold_4
(containing 8 Hb tandem genes and others).
List of scaffold_4 Tile expression outside gene predictions
with map views.
Sample Augustus-Tile expression predictions
Example 3 Tile expr. in unpredicted region
Tile expr. hints call genes with strong Tile expression,
missed by other predictors.
Tracks:
Tarallx = tile expression (TAR) data used for predictions, abstracted
from 'Genome Tile Expression' using runs of data, control data.
Augustus-noevid+pasa = No evidence prediction from Daphnia-gene trained Augustus.
Aug+TarAll,+Tarall9 = Exon-part hints from TAR data, various weights.
Aug+TArAll.14,.17,.18 = CDS-part hints from TAR data, various weights.
Gnomon, JGI = other predictions
|
Example 2 Tile expr. in unpredicted region
Many exons supported by low/medium Tile expr.
CDS-hints call several short genes w/ CDS. Exon-hints
call fewer, long-exon genes w/ no CDS.
|
Tandem genes mismodelled by Gnomon/JGI
Augustus trained, no evidence calls this region best. TAR hints add
a few exons but reduce gene model quality.
|
Hemoglobin 8-tandem region
No-evidence gene models are most accurate. Tile hints break genes into
parts. Note also several valid exons lack TAR data due to high-identity
exons (tile array requires unique oligos).
|
Example 1 Tile expr. in unpredicted region
|
Example 4 Tile expr. in unpredicted region
|
Augustus + TAR test runs w/ various hints, weights
I've tried several quality measures for assessing these, but
this simple one seems most useful: counting CDS bases / total bases.
Final run on all of genome used Test17pa: strong weights on CDS-hints for TAR hints only
Baseline predition w/ no hints is Test 1pa.
CDSbases Test 1 : ntr=386, n=3405, m=236.62, cds=805701, tb=3067984, c/t=0.263
CDSbases Test 1pa: ntr=412, n=3638, m=232.80, cds=846910, tb=3067984, c/t=0.276 + base predict, no TAR data
CDSbases Test 2 : ntr=701, n=2636, m=261.11, cds=688275, tb=3067984, c/t=0.224
CDSbases Test 3 : ntr=737, n=2234, m=274.32, cds=612840, tb=3067984, c/t=0.200
CDSbases Test 4 : ntr=555, n=3364, m=239.09, cds=804282, tb=3067984, c/t=0.262
CDSbases Test 5 : ntr=641, n=2849, m=258.98, cds=737841, tb=3067984, c/t=0.240
CDSbases Test 6 : ntr=565, n=3382, m=240.33, cds=812790, tb=3067984, c/t=0.265
CDSbases Test 7 : ntr=541, n=3425, m=238.88, cds=818151, tb=3067984, c/t=0.267
CDSbases Test 8 : ntr=716, n=1220, m=387.49, cds=472743, tb=3074807, c/t=0.154
CDSbases Test 9 : ntr=765, n=4039, m=229.10, cds=925320, tb=3069252, c/t=0.301 +
CDSbases Test 10 : ntr=645, n=2423, m=277.91, cds=673374, tb=3069252, c/t=0.219
CDSbases Test 11 : ntr=600, n=2985, m=254.18, cds=758730, tb=3067984, c/t=0.247
CDSbases Test 12 : ntr=524, n=4118, m=227.18, cds=935511, tb=3068017, c/t=0.305 +
CDSbases Test 13 : ntr=428, n=3701, m=233.91, cds=865719, tb=3067984, c/t=0.282
CDSbases Test 14 : ntr=1684, n=5858, m=190.33, cds=1114974, tb=3070403, c/t=0.363 +
CDSbases Test 16pa : ntr=1198, n=4132, m=188.78, cds=780040, tb=3074819, c/t=0.254
CDSbases Test 17pa : ntr=1601, n=5602, m=207.05, cds=1159882, tb=3074819, c/t=0.377 * 30% higher than base level
CDSbases Test 18pa : ntr=1124, n=5042, m=214.57, cds=1081885, tb=3074819, c/t=0.352 *
ntr= number of transcripts/genes; n=no. CDS; m=mean CDS length; cds=sum of CDS-bases;
tb=total bases in gene regions; c/t= CDS/total bases ratio
Sample Augustus HINTS configuration for TAR data
Notes:
Use only *parts hints as precise hints fail due to imprecision (multi-base)
of TAR data.
Non-exon hints are possible from TAR data, but tests show available
intron/intergenic hints are hurting gene calls, probably due to uncertainty.
Exon-part versus CDS-part: TAR data is for exon expression (including UTR)
however using only TAR exon-part results in almost no CDS predictions, only
nonCDS exons. Using CDS-part hints appears most effective, but gene models
are short and partial compared to no-hint predictions.
[SOURCES]
M T P E
# source of extrinsic information:
# M manual anchor (required)
# T transcript active regions from genome tile array
# P protein database hit
# E est database hit
[SOURCE-PARAMETERS]
E individual_liability
P individual_liability
T individual_liability
# feature bonus malus gradelevelcolumns
# r+/r-
# tests show G:intronpart,exonpart,CDSpart,irpart are useful
# CDS,exon,intron start,stop,ass,dss are problematic: drop
[GENERAL]
exonpart 1 1 M 1 1e+100 P 1 1 E 1 1e+29 G 1 1e+9
T 4 94 98 998 1 1e+19 1e+19 1e+35
intronpart 1 1 M 1 1e+100 P 1 1 E 1 1e+9 G 1 1e+9 T 1 1e+8
CDSpart 1 1 M 1 1e+100 P 4 40 90 140 1 1e+4 1e+7 1e+9 E 1 1 G 1 1e+15
T 4 94 98 998 1 1e+19 1e+19 1e+35
UTRpart 1 1 M 1 1e+100 P 1 1 E 1 1e+5 G 1 1e+0 T 1 1
irpart 1 1 M 1 1e+100 P 1 1 E 1 1e+5 G 1 1e+9 T 1 1e+8
nonexonpart 1 1 M 1 1e+100 P 1 1 E 1 1e+5 G 1 1e+0 T 1 1e+8
Sample TAR hints: Hb genes
# dpulex-sc4hbregion-tar-all.gff
# Hemoglobin region: (from PASA)
# scaffold_4:2366641-2390393
# from annotated NCBI GNO genes
# view at http://server2.eugenes.org/cgi-bin/gbrowsenew/gbrowse/dpxaugust/
# Hb1 scaffold_4:2366621-2368356:+
scaffold_4 tarall ep 2366607 2367052 95 . . g=Hb1.ex1-ex2;exp=male;exp99=male;ovr=1;src=T
scaffold_4 tarall ep 2366657 2366732 999 . . g=Hb1.ex1;exp=male;exp99=fem,metal;exp95=fem,metal;ovr=1;src=T
scaffold_4 tarall ep 2366677 2366732 999 . . g=Hb1.ex1;exp=fem,metal;src=T
scaffold_4 tarall ep 2366807 2367052 99 . . g=Hb1.ex2+;exp=fem,male;exp95=fem;ovr=1;src=T
scaffold_4 tarall ep 2366877 2366952 999 . . g=Hb1.ex2-;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
scaffold_4 tarall ep 2367157 2367327 999 . . g=Hb1.ex3;exp=fem,male,metal;exp99=fem,male,metal;exp95=male,fem,metal;src=T
scaffold_4 tarall ep 2367432 2367552 99 . . g=Hb1.ex4;exp=fem,male,metal;exp95=male,fem,metal;exp999=fem,male,metal;src=T
scaffold_4 tarall ep 2367602 2368577 99 . . g=Hb1.ex5-ex7;exp=male;exp95=male,fem;ovr=1;src=T
scaffold_4 tarall ep 2367657 2368552 99 . . g=Hb1.ex5-ex7;exp=fem,metal;exp999=male,fem;exp95=metal;ovr=1;src=T
scaffold_4 tarall ep 2367682 2367752 999 . . g=Hb1.ex5;exp=metal;src=T
scaffold_4 tarall ep 2367857 2368057 999 . . g=Hb1.ex6;exp=fem,metal;exp95=metal;exp99=metal;src=T
scaffold_4 tarall ep 2368132 2368332 95 . . g=Hb1.ex7;exp=metal;ovr=1;src=T
scaffold_4 tarall ep 2368152 2368507 999 . . g=Hb1.ex7+;exp=male,fem;exp99=metal;ovr=1;src=T
scaffold_4 tarall ep 2368207 2368332 999 . . g=Hb1.ex7-;exp=metal;src=T
scaffold_4 tarall ep 2368728 2368778 95 . . g=Hb1.ex7>;exp=male;src=T;flag=newexon:utr
scaffold_4 Gnomon mRNA 2366621 2368356 169.02 + . model=914044;ID=914044;Parent=gene914044;flags=EST,Prot,Start,Stop;protCDS=2366684 2368246;protein_hit=gi|5881967|gb|AAD55141.1|;support=1416043
scaffold_4 Gnomon exon 2366621 2366746 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2366863 2367023 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2367103 2367328 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2367424 2367548 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2367630 2367752 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2367831 2368059 . + . model=914044;Parent=914044
scaffold_4 Gnomon exon 2368133 2368356 . + . model=914044;Parent=914044
# Hb2 scaffold_4:2370110-2371749:+
scaffold_4 tarall ep 2370033 2370203 95 . . exp=male,fem;ovr=1;src=T
scaffold_4 tarall ep 2370058 2370203 99 . . exp=male;ovr=1;src=T
scaffold_4 tarall ep 2370078 2370183 99 . . exp=fem;ovr=1;src=T
scaffold_4 tarall ep 2370103 2370183 999 . . g=Hb2.ex1;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
scaffold_4 tarall ep 2370284 2370529 99 . . exp=male,fem;exp95=male,fem,metal;exp999=fem;src=T
scaffold_4 tarall ep 2370634 2371879 95 . . exp=male,fem,metal;exp99=male,fem,metal;exp999=fem,male,metal;src=T
scaffold_4 tarall ep 2370809 2371854 99 . . exp=male;exp95=fem;ovr=1;src=T
scaffold_4 tarall ep 2370829 2370929 999 . . g=Hb2.ex4;exp=fem,male;exp99=fem;ovr=1;src=T
scaffold_4 tarall ep 2370859 2370929 99 . . g=Hb2.ex4;exp=metal;exp95=metal;ovr=1;src=T
scaffold_4 tarall ep 2370879 2370929 999 . . g=Hb2.ex4;exp=metal;src=T
scaffold_4 tarall ep 2370979 2371479 99 . . exp=fem;ovr=1;src=T
scaffold_4 tarall ep 2371059 2371209 999 . . g=Hb2.ex5;exp=fem,male;ovr=1;src=T
scaffold_4 tarall ep 2371079 2371159 99 . . exp=metal;exp95=metal;ovr=1;src=T
scaffold_4 tarall ep 2371109 2371159 999 . . g=Hb2.ex5-;exp=metal;src=T
scaffold_4 tarall ep 2371254 2371459 999 . . g=Hb2.ex6;exp=fem,male,metal;exp99=metal;exp95=metal;src=T
scaffold_4 tarall ep 2371529 2371829 99 . . g=Hb2.ex7;exp=fem;ovr=1;src=T
scaffold_4 tarall ep 2371554 2371829 999 . . g=Hb2.ex7;exp=male,fem,metal;exp95=metal;exp99=metal;ovr=1;src=T
scaffold_4 tarall ep 2371679 2371729 999 . . g=Hb2.ex7-;exp=metal;src=T
scaffold_4 Chainer mRNA 2370110 2371749 163.191 + . model=1418043;ID=1418043;Parent=gene916044;flags=EST,Prot,Start,Stop,FullSupCDS;protein_hit=gi|5881967|gb|AAD55141.1|;support=1204041,1206041,1436041,1502041,1504041,1644041,2286041,2288041,2618041,2620041,4132041,7842042,7844042,7846042,7852042,*7854042
scaffold_4 Chainer exon 2370110 2370184 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2370276 2370436 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2370514 2370739 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2370819 2370943 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2371045 2371167 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2371238 2371466 . + . model=1418043;Parent=1418043
scaffold_4 Chainer exon 2371535 2371749 . + . model=1418043;Parent=1418043
# Hb3 scaffold_4:2372736-2374380:+
scaffold_4 tarall ep 2372736 2372826 95 . . exp=metal;ovr=1;src=T
scaffold_4 tarall ep 2372761 2372826 999 . . exp=male,metal;exp99=fem,male,metal;exp95=male,fem;src=T
scaffold_4 tarall ep 2372886 2373076 95 . . exp=fem;exp99=fem;ovr=1;src=T
scaffold_4 tarall ep 2372901 2373151 95 . . exp=male;src=T
scaffold_4 tarall ep 2373609 2374255 95 . . exp=male;ovr=1;src=T
scaffold_4 tarall ep 2373660 2373805 95 . . exp=fem;exp99=male;ovr=1;src=T
scaffold_4 tarall ep 2373685 2373780 999 . . exp=fem,male;exp99=fem;ovr=1;src=T
scaffold_4 tarall ep 2373705 2373780 99 . . exp=metal;exp95=metal;src=T
scaffold_4 tarall ep 2373905 2374255 99 . . exp=fem,male,metal;exp95=fem,metal;exp999=fem,male,metal;src=T
scaffold_4 tarall ep 2374180 2374255 999 . . exp=male;exp95=metal;ovr=1;src=T
scaffold_4 tarall ep 2374205 2374255 999 . . exp=fem,metal;exp99=metal;src=T
scaffold_4 Chainer mRNA 2372736 2374380 172.894 + . model=1426043;ID=1426043;Parent=gene918044;flags=EST,Prot,Start,Stop,FullSupCDS;maxCDS=2372736 2374287;protCDS=2372776 2374275;protein_hit=gi|2105139|gb|AAC47544.1|;support=958041,960041,998041,1000041,1272041,1274041,2108041,7856042,7858042,7860042,*7862042,7864042,7866042,7868042
scaffold_4 Chainer exon 2372736 2372832 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2372901 2373058 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2373121 2373346 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2373419 2373543 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2373673 2373795 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2373868 2374093 . + . model=1426043;Parent=1426043
scaffold_4 Chainer exon 2374171 2374380 . + . model=1426043;Parent=1426043
# Hb4 scaffold_4:2375586-2377700:+ ; missing tile data & weakly expressed
scaffold_4 tarall ep 2377333 2377403 95 . . g=Hb4.ex7-;exp=male,fem,metal;exp999=male,metal;exp99=fem,male,metal;src=T
scaffold_4 tarall ep 2377453 2377628 95 . . g=Hb4.ex8?;exp=male;ovr=1;src=T
scaffold_4 tarall ep 2377553 2377628 99 . . exp=fem,male;exp95=fem;exp999=fem;src=T
scaffold_4 Gnomon mRNA 2375586 2377700 175.064 + . model=920044;ID=920044;Parent=gene920044;flags=EST,Prot,Start,Stop;protCDS=2375724 2377558;protein_hit=gi|4589706|dbj|BAA76871.1|;support=1428043
scaffold_4 Gnomon exon 2375586 2375732 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2376093 2376143 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2376211 2376368 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2376433 2376658 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2376738 2376862 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2376975 2377097 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2377165 2377390 . + . model=920044;Parent=920044
scaffold_4 Gnomon exon 2377451 2377700 . + . model=920044;Parent=920044
# Hb5 scaffold_4:2380765-2382320:+ ; ** several TAR spots missed due2 duplicate exon problem
scaffold_4 tarall ep 2380101 2380161 95 . . g=Hb5.ex1-;exp=fem;src=T;gene=Hb5.below
scaffold_4 tarall ep 2381276 2381336 95 . . g=Hb5.ex3;exp=fem;src=T;gene=Hb5.ex3
scaffold_4 tarall ep 2381983 2382033 999 . . g=Hb5.ex6;exp=fem,male;gene=Hb5.ex6;exp99=fem,male,metal;exp95=male,fem,metal;src=T
scaffold_4 tarall ep 2382210 2382276 99 . . g=Hb5.ex7;exp=fem;gene=hb5.ex7;exp95=male,fem;src=T
scaffold_4 Gnomon mRNA 2380765 2382320 172.019 + . model=924044;ID=924044;Parent=gene924044;flags=EST,Prot,Start,Stop;protCDS=2380774 2382201;protein_hit=gi|4589706|dbj|BAA76871.1|;support=1424043
scaffold_4 Gnomon exon 2380765 2380827 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2380895 2381052 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2381117 2381342 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2381416 2381540 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2381623 2381745 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2381813 2382038 . + . model=924044;Parent=924044
scaffold_4 Gnomon exon 2382097 2382320 . + . model=924044;Parent=924044
Don Gilbert
gilbertd [A] indiana.edu
|