Don Gilbert's recipe for RNA-Seq assembly to gene models ===================================================================================== 1. Map reads to genome, with bowtie/tophat, using accepted_hits.sam result using options for 1,2 mismatches, filter out later as desired. tophat --min-isoform-fraction 0.1 --mate-inner-dist $matedist --splice-mismatches 1 \ --min-intron 40 --max-intron 300000 --min-anchor 5 --segment-length $readsize \ --min-closure-intron 40 --min-coverage-intron 40 --min-segment-intron 40 \ --max-closure-intron 50000 --max-coverage-intron 300000 --max-segment-intron 500000 \ -o $query $genome $seq1 $seq2 2. Process accepted_hits.sam to signal location count table hits=$tpart/accepted_hits.sam.gz gunzip -c $hits | $APPDIR/matecovtab.pl -debug -hits=stdin > matecov.tab This is a superset of tophat's coverage.wig 2b. merge tables where sensible; best result is from merge of all data sets. cat $parts/matecov.tab | sort -k1,1 -k2,2n -k3,3nr | matecovmerge.pl -sorted > matecov.merge.tab 3. Produce rough gene models from signal table using run-length encoding of signals. This is roughly analogous to transcript fragment calculations, using only the signal table from mapped reads. Rough models include exons, introns, and gene-ends with gene ids, and no-signal spans. Besides exon/transcript fragments from read signal, this step uses intron/splices and mate coverage to provide strand and gene ends; complete if rough gene models. matecov2gene.pl -place -step 1 -in matecov.merge.tab > matecov.rough.genes.gff rough.genes are surprisingly good given the simple method, but have various problems, no alternate transcripts. 4. Refine rough models, produce alternate transcripts, and resolve problems including overlappedIntrons, adjacentExons, exonsMissIntrons, endInIntron, exonsInIntrons, mixedstrand macovsup.pl -nodebug -model matecov.rough.genes.gff > matecov.fine.genes.gff fine.genes includes mRNA/exon/intron models, with alternate transcripts, read and mate scores and paths, gene quality scores. Quality scores include cdspoor, exonabut,exonshort,intronabut,intronend,intronxgap,mixedstrand,noscore,overpoor This step when attempting to resolve problems is conservative, e.g. it will close gaps between exon and intron only if smaller than a minimum intron span. There are various improvements to steps 3 and 4 that can be added, including fuller use of mate pair linkages, better interpretation of mate/read signal changes, finer resolution of exon end points, etc. While currently this recipe uses no signals interpreted from sequence structure, such as TATA, splice sites, etc., such would improve the result. This perhaps is better left to another program already designed for such. 2009 Sept 08, gilbertd at indiana edu ===================================================================================== Documents, results and software: http://insects.eugenes.org/species/data/dmel5/modencode/rgaspdg/ Gene map views: http://insects.eugenes.org/cgi-bin/gbrowsenew/gbrowse/drosmel5dg/ Notes: Evaluation: see dgg_rnaseq_eval.txt * processing can be done per chromosome sublocation. Multi-chrom processing hasn't been tested fully yet, may want some bug fixes. * One DrosMel chromosome (25Mb) is handled in 2 GB memory generally. Steps 2,3 are most memory intensive. Lots of reads reduce to approx. constant data size/chrom in step 2. * steps 3 and 4 still make mistakes that can be resolved. I.e. the data is better than this result. * should work for any euk. genome data. Some parameters likely need adjusting, mostly huge span settings that reduce data errors (max intron size, etc.). =====================================================================================