Finding Genes in Drosophila RNA-Seq dataThis document describes new software for gene (transcript) assembly from RNA-Seq data, for the rGASP challenge. Results include a Drosophila melanogaster gene assembly set, observations, genome maps of RNA-Seq signals and the gene assembly, and an evaluation comparing EST assemblies.
Don Gilbert, 2009 August, gilbertd at indiana.edu
RNA Seq assembly to genes for rGASPResults in brief
Methods and ResultsThe method used shares aspects of EST assembly, starting with ESTs mapped to a genome, and run-length encoding calculations of transcription fragments in use for NGS data (reads and tiling arrays).
Brief recipe for RNA-Seq assembly
Results of RNA-Seq assembly for DrosMel
Exons and introns from RNA-Seq read coverageThe count of reads overlapping any base gives read coverage, a value that is high for expressed exons, low or zero over introns and intergenic/unexpressed spans. Simply tabulating spans of high read coverage gives approximate exons. Finding reads that map to a genome in pieces identifies intron splices, rather precisely when coverage is high and many reads break at the same place.
Gene ends where Mate coverage dropsIn the quest to find genes from this new data, I wanted to satisfy myself whether it is ready to replace EST data. Until recently I would have said no, the ESTs still were still matching genes best. After racking my brains over the question of what ESTs have that short read RNA seq doesn't, a few weeks back I found an RNA seq signal that makes a big difference:
Mate coverage from paired end reads, versus read coverage, is a key signal. Mates cross introns but stop at the gene ends.
That is simple enough, what surprised me is how reliable this is as a gene-end signal. The signal graph of mate pair spans is high over the gene span, and drops at gene end points. This is especially helpful where genes are smooshed together, with overlapping UTRs (often on reverse strand), and the read coverage may be (equally) high across a span of several genes. In these cases often there is a drop in mate coverage at gene ends.
The simpe useful statistic is of this is the ratio of mate count / read count. This is roughly 1 over exons, much higher than 1 over introns (mates stay high while reads drop), and lower than 1 at gene ends (where mates stop). Also useful is log-mate-ratio = log((matecount+1)/(readcount+1)). The log-mate-ratio cut-offs I found that work well are +2 for intron, and -1 for gene end.
A small trick in calculating this: the mate span here is the inner span between mates, excluding their read length. This count will drop at read-length bases before gene end points. If you include the read span in mate count, their ratio tends to be 1 up to the gene end. Though you can still use the drop in mate count, the ratio isn't as helpful. Program matecovtab.pl counts this way in subroutine readSam().
ExamplesIn these examples, reference gene models are at top (blue), and RNA seq coverage graphs at bottom. Orange track "read_cover" is average read count for four cell lines, a usual expression signal graph. Above that, green track "mate_cover" is the paired mate count graph. The key differece from read cover is there are no drops in signal under introns. These green mountains cover entire gene spans, with drops at gene ends. The intermediate red track "mate_ratio" is the log ratio of these two, with high peaks under introns, and negative troughs at gene ends. The track "stranded RNA seq" at bottom is extra special data from strand-specific RNA seq, showing reverse strand expression below baseline.
Intermediate tracks "rnaexon", "rnageneend", "rnaintron", "rnalintron", "rnanoscore" are simple run length spans derived from those two read_cover and mate_cover graphs. These match roughly, but well, the reference genes, including gene breaks at overlapped genes.
Software notesThe following Perl programs will produce these results. These are still in development, and may be best used by incorporating in other gene finding programs.
The table has genome location (chr, start, end) and counts for these signals:
mate_u : mate coverage (short spans < average exon length, -nearmate=250 default) matefar_u: mate coverage over long introns (these can contain other genes) pread_u : paired read count sread_u : single read count -- pread_u and sread_u should sum to tophat's coverage.wig table strand_f : forward strand from intron splices strand_r : reverse strand from intron splice introns : intron location ID:count pairs : mate pair location ID:countThe strand columns are from spliced reads, where strand is detected by intron splice orientation. These span reads up to splice points, and its mate read. Note one can find both forward and reverse strands in same base span. "read_u" denotes unstranded.
Step 2b. matecovmerge.pl merges 2 or more of these tables. It should be working perfectly, that is a merged table will be identical to results if all hits.sam are processed by matecovtab.pl. Useful for parallel processing and combining subsets.
Step 3. matecov2gene.pl reads that matecov.tab and generates rough gene models (exons, introns, gene ends and noscore) in GFF format.
Step 4. macovsup.pl refines the rough models from matecov2gene.pl