# rgaspdg-model-notes.txt Thanks much for your comments. ... It took me a while to reach my criteria for being worth while, something better than EST assemblies, even after I saw this data had what is needed to make full gene models. My approach pretty much guarantees the problems at this stage that you noted, mixed-stranded genes, exons that dont touch adjacent introns, exon-exon and even intron-intron parts (I've removed for rGASP the intron-only transcripts :). [See dgg_rnaseq_eval.txt for error counts] The last two weeks have been spent reducing those errors, one of the lessons in this is not to take short cuts, i.e. leave out detail checks, that I'd already known. I can't eliminate all model errors quickly w/o unwarranted assumption; there are real cases of mixed-stranded genes, but probably all of mine are spurious. However to eliminate them this method needs cleaner gene-end data. There is more work needed on this approach, including fuller use of mate pair linkages that would close many of those exon fragments you see, and more accurately define gene models. What I wanted to see done with this data, in addition to full gene prediction by you and others, is a relatively simple data interpretation that minimizes assumptions of underlying biology. I've tried to do that in this software set, being conservative about changing strandedness or joining gaps, trying to get the RNA-Seq data to say what the genes are like. I can see that this data is very good for that. Eventually someone will produce a good RNA-seq assembler that doesn't assume things about the data that are not there. I looked at alternates to Tophat/Bowtie for mapping reads, but didn't want to spend too much time on choice of mapping software. I also have some concerns it isn't working perfectly, but I think the errors it contributes are the smallest concern in what I've done. I ran Tophat on subsets of the data (each technical block) then combined results, so the overloading issue you note may be smaller that way. A larger concern to me is the issue of duplicated reads, and how ignoring them biases results for duplicated gene regions. About 1/2 the paired reads did not have resulting mates in the analysis I ran with tophat/bowtie, that is on the order of 1.5 billion missing mates versus 1.7 billion that did have their mate mapped. BLAT would have given me all the matches, and let me decide what to do w/ the duplicates (which may have delayed me enough to miss this deadline :). Some of this mis-mate problem may be technical quality issue, as the DrosMel data appeared to improve over time, and has different rates in the different cell lines. - Don