Don Gilbert's recipe for RNA-Seq assembly to gene models 2009 Sept, gilbertd at indiana edu ===================================================================================== Evaluation: * Using DrosMel reference genes, FlyBase gene version 5.10, and comparison to PASA EST assembly of 570,000 ESTs (dbEST, March 2008). * Ref. genes are filtered to remove genes with less than 40 bases of read coverage (>4reads/base) to evaluate gene models where coverage exists. * Models are scored for sensitivity and specificity at exon, transcript and gene levels, using bases matched and bases missed (over and under) of reference genes. THese are compared with EST assembly scores. * Model quality errors are also tabulated by macovsup. ===================================================================================== DrosMel all chroms: All EST asm #collect_overlaps=108684 ngene=16412; genehit=10400; gperfect=5550; nexon=52759; exonhit=41735 Exon: Sn=69.7; Sp=82.7; Sp2=37.3; Ave=76.2; hit=18053972; miss=3751482; exonbase=25866420; allover=48395156 BestTr: Sn=57; Sp=83.1; Sp2=48; Ave=70.05; hit=23237010; miss=4715178; allbase=40728485; Gene: Sn=54.5; Sp=85.2; Sp2=25.5; Ave=69.85; hit=12378575; miss=2146236; genebase=22680550; #overlaps found=41735 RNASeq.fine.genes; macovsup v1.19 exons:67608, introns:31835, genes:35122, alt-tr:6839 (excluding poor*) #collect_overlaps=139547 ngene=16412; genehit=12409; gperfect=4156; nexon=52759; exonhit=40119 Exon: Sn=74.1; Sp=55.2; Sp2=33.8; Ave=64.65; hit=19185234; miss=15521787; exonbase=25866420; allover=56641135 BestTr: Sn=69.5; Sp=71.5; Sp2=50; Ave=70.5; hit=28330955; miss=11288767; allbase=40728485; Gene: Sn=68.8; Sp=72.9; Sp2=27.5; Ave=70.85; hit=15615566; miss=5776000; genebase=22680550; #overlaps found=40119 Quality error counts on all RNASeq.fine models 21971 cdspoor * 6019 exonabut 12178 exonshort 1389 intronabut 3048 intronend 2316 intronxgap 991 mixedstrand 8286 noscore * 4223 overpoor * * = classes excluded from tally above ===================================================================================== Read and alignment counts per grouping, as count(per-reads) all: nr=338896759, ni=39, avelen=1552(39), farmate=1836849(0.00542), mated=134908497(0.398), mismate=119680942(0.353), na=413953537(1.22), poor=39471276(0.116), stranded=4618360(0.0136), # Cell line (pe = paired end reads, sr = single reads) dmCMEpe: nr=50095520, ni=5, avelen=185(37), farmate=284143(0.00567), mated=19557742(0.39), mismate=34091560(0.681), na=56144412(1.12), poor=7408476(0.148), stranded=580675(0.0116), dmKc167pe: nr=49099950, ni=5, avelen=185(37), farmate=344505(0.00702), mated=17906104(0.365), mismate=23757345(0.484), na=44607619(0.909), poor=7100159(0.145), stranded=577815(0.0118), dmKc167sr: nr=46800876, ni=7, avelen=252(36), farmate=0(0), mated=0(0), mismate=0(0), na=79415242(1.7), poor=2621574(0.056), stranded=552195(0.0118), dmMLpe: nr=46974396, ni=6, avelen=222(37), farmate=328651(0.007), mated=27902922(0.594), mismate=38088029(0.811), na=68272911(1.45), poor=5734540(0.122), stranded=669985(0.0143), dmS2pe: nr=103022042, ni=9, avelen=333(37), farmate=879550(0.00854), mated=69541729(0.675), mismate=23744008(0.23), na=98524849(0.956), poor=7910580(0.0768), stranded=1643909(0.016), dmS2sr: nr=42903975, ni=7, avelen=375(53), farmate=0(0), mated=0(0), mismate=0(0), na=66988504(1.56), poor=8695947(0.203), stranded=593781(0.0138), # Read length rl36: nr=46800876, ni=7, avelen=252(36), farmate=0(0), mated=0(0), mismate=0(0), na=79415242(1.7), poor=2621574(0.056), stranded=552195(0.0118), rl37: nr=249191908, ni=25, avelen=925(37), farmate=1836849(0.00737), mated=134908497(0.541), mismate=119680942(0.48), na=267549791(1.07), poor=28153755(0.113), stranded=3472384(0.0139), rl45: nr=27808540, ni=5, avelen=225(45), farmate=0(0), mated=0(0), mismate=0(0), na=43688500(1.57), poor=8695947(0.313), stranded=98549(0.00354), rl75: nr=15095435, ni=2, avelen=150(75), farmate=0(0), mated=0(0), mismate=0(0), na=23300004(1.54), poor=0(0), stranded=495232(0.0328), # Read type (PE = paired reads, SR= single reads) typePE: nr=249191908, ni=25, avelen=925(37), farmate=1836849(0.00737), mated=134908497(0.541), mismate=119680942(0.48), na=267549791(1.07), poor=28153755(0.113), stranded=3472384(0.0139), typeSR: nr=89704851, ni=14, avelen=627(44), farmate=0(0), mated=0(0), mismate=0(0), na=146403746(1.63), poor=11317521(0.126), stranded=1145976(0.0128), # Run Date 080715: nr=9649105, ni=2, avelen=90(45), farmate=0(0), mated=0(0), mismate=0(0), na=13080945(1.36), poor=3348870(0.347), stranded=31924(0.00331), 080815: nr=18159435, ni=3, avelen=135(45), farmate=0(0), mated=0(0), mismate=0(0), na=30607555(1.69), poor=5347077(0.294), stranded=66625(0.00367), 081114: nr=10654850, ni=1, avelen=37(37), farmate=91445(0.00858), mated=8443925(0.792), mismate=1839736(0.173), na=10675788(1), poor=533863(0.0501), stranded=178834(0.0168), 081121: nr=88856312, ni=8, avelen=296(37), farmate=660804(0.00744), mated=46155339(0.519), mismate=41129024(0.463), na=92058505(1.04), poor=10640414(0.12), stranded=1152538(0.013), 081202: nr=47130556, ni=7, avelen=259(37), farmate=488576(0.0104), mated=24616410(0.522), mismate=26682125(0.566), na=53577983(1.14), poor=5203868(0.11), stranded=693305(0.0147), 081216: nr=67593194, ni=6, avelen=222(37), farmate=457444(0.00677), mated=34408898(0.509), mismate=39903606(0.59), na=77922250(1.15), poor=8750030(0.129), stranded=920950(0.0136), 081223: nr=34956996, ni=3, avelen=111(37), farmate=138580(0.00396), mated=21283925(0.609), mismate=10126451(0.29), na=33315265(0.953), poor=3025580(0.0866), stranded=526757(0.0151), 090311: nr=4988862, ni=1, avelen=75(75), farmate=0(0), mated=0(0), mismate=0(0), na=8513382(1.71), poor=0(0), stranded=188153(0.0377), 090512: nr=10106573, ni=1, avelen=75(75), farmate=0(0), mated=0(0), mismate=0(0), na=14786622(1.46), poor=0(0), stranded=307079(0.0304), #............... Value = sum (per-reads) Keys: nr = number of reads/group (2 paired reads are counted as 2) ni = number of replicate runs/group na = number of accepted_hits alignments (using tophat/bowtie -k 40 max alignments/read); avelen = average read length mated = alignments with valid pairs, mismate = alignments with invalid mate, poor = under 97% identity (2+ mismatches for 37bp), stranded = spliced alignments, farmate = mates excluded by distance (25Kb in this run, should be increased greatly). Note: na/nr ratio is 1.0 for PE, but 1.6 for SR, due to way Tophat chucks out mate-pair duplicates with longer inner spans. =====================================================================================