DroSpeGe About Arthropods BLAST BioMart Maps Data News

Finding Genes in Drosophila RNA-Seq data

This document describes new software for gene (transcript) assembly from RNA-Seq data, for the rGASP challenge. Results include a Drosophila melanogaster gene assembly set, observations, genome maps of RNA-Seq signals and the gene assembly, and an evaluation comparing EST assemblies.

Don Gilbert, 2009 August, gilbertd at indiana.edu

RNA Seq assembly to genes for rGASP

Results in brief
Map views include reference gene models, PASA EST assemblies, this work's RNA-seq assemblies, RNA-seq read signal graphs for all data and each of 4 cell lines, mate pair coverage graphs, and the log ratio (Mate/Read).

Methods and Results

The method used shares aspects of EST assembly, starting with ESTs mapped to a genome, and run-length encoding calculations of transcription fragments in use for NGS data (reads and tiling arrays).

Brief recipe for RNA-Seq assembly

  1. Map reads to genome with Tophat and Bowtie.
  2. Process accepted_hits.sam to a location count table for mates, reads, introns and strands.
  3. Produce rough gene models from this table using run-length encoding of signals. This similar to, a superset of, transcript fragment calculations. Besides expanding exon/transcript fragments from read signal, it uses intron splices, mate coverage and mate pairings to determine strand, introns and gene ends: complete if rough gene models.
  4. Refine rough models, produce alternate transcripts, and resolve problems.
  5. The software in Perl is relatively simple and correctable.
Improvements can be made in steps 3 and 4, including fuller use of mate pair linkages, better interpretation of mate/read signal changes, finer resolution of exon end points, etc. This recipe uses no signals interpreted from genome sequence, such as TATA, splice sites, etc., that would improve the result. This perhaps is better left to a program already designed for such.

Results of RNA-Seq assembly for DrosMel

Sample gene map of results at 3L:199100..209400 dm3L200k-macovsup118.png Find at ftp://eugenes.org/eugenes/rgasp/ the following results:
  1. soft/ The software developed and used for this, aside from Tophat and Bowtie.
  2. dmel/readmap/ Step 1.
    The reads mapped by bowtie and tophat, including accepted_hits.sam for each sample group.
    dmel_kc167_readmap2.tar for cell line Kc167
    dmel_kcms_readmap3.tar for cell lines CME, ML, S2,
    dmel_kcms_readmap3l.tar for cell S2 single long-reads only (updated after 1st run)
  3. dmel/matecov39/ Step 2 - 4.
    Output combined for all lines (dmAll) and Per cell line (dmCME, dmKc167, dmML, dmS2) with
    matecov39.merg3.all.tab.gz (step 2 table)
    matecov39.merg3.rogenes.*.gff.gz (step 3 rough genes)
    matecov39.merg3.figenes.*.gff.gz (step 4 fine genes)
  4. dmel/gtf4rgasp/ GTF format as specified for rGASP details with one file per cell line and dmAll, from the above matecov39.merg3.figenes.*.gff
The file dmel/matecov39/th.matecov39.info contains counts of reads, mates, stranded, errors from Step 2 tables. This is the summary per cell line:

  • All: nr=214,300,805; ni=39, avelen=39, farmate=1,836,849 (0.00857), mated=134,908,497 (0.63), mismate=119,680,942 (0.558), na=413,953,537 (1.93), poor=39,471,276 (0.184), stranded=4,618,360 (0.0216),
  • dmCME: nr=25,047,760; ni=5, avelen=37, farmate=284,143 (0.0113), mated=19,557,742 (0.781), mismate=34,091,560 (1.36), na=56,144,412 (2.24), poor=7,408,476 (0.296), stranded=580,675 (0.0232),
  • dmKc167: nr=71,350,851; ni=12, avelen=36, farmate=344,505 (0.00483), mated=17,906,104 (0.251), mismate=23,757,345 (0.333), na=124,022,861 (1.74), poor=9,721,733 (0.136), stranded=1,130,010 (0.0158),
  • dmML: nr=23,487,198; ni=6, avelen=37, farmate=328,651 (0.014), mated=27,902,922 (1.19), mismate=38,088,029 (1.62), na=68,272,911 (2.91), poor=5,734,540 (0.244), stranded=669,985 (0.0285),
  • dmS2pe: nr=51,511,021; ni=9, avelen=37, farmate=879,550 (0.0171), mated=69,541,729 (1.35), mismate=23,744,008 (0.461), na=98,524,849 (1.91), poor=7,910,580 (0.154), stranded=1,643,909 (0.0319),
  • dmS2sr: nr=42,903,975; ni=7, avelen=53, farmate=0(0), mated=0(0), mismate=0(0), na=66,988,504 (1.56), poor=8,695,947 (0.203), stranded=593,781 (0.0138),
Value = sum (per-reads) Keys: nr = number of reads/group, ni = number of replicate runs/group, na = number of accepted_hits alignments (tophat/bowtie -k 40 max alignments/read). mated = alignments with valid pairs, mismate = alignments with invalid mate, poor = under 97% identity (2+ mismatches for 37bp), stranded = spliced alignments, farmate = mates excluded by distance (25Kb in this run, should be increased greatly).


Exons and introns from RNA-Seq read coverage

The count of reads overlapping any base gives read coverage, a value that is high for expressed exons, low or zero over introns and intergenic/unexpressed spans. Simply tabulating spans of high read coverage gives approximate exons. Finding reads that map to a genome in pieces identifies intron splices, rather precisely when coverage is high and many reads break at the same place.

Gene ends where Mate coverage drops

In the quest to find genes from this new data, I wanted to satisfy myself whether it is ready to replace EST data. Until recently I would have said no, the ESTs still were still matching genes best. After racking my brains over the question of what ESTs have that short read RNA seq doesn't, a few weeks back I found an RNA seq signal that makes a big difference:

Mate coverage from paired end reads, versus read coverage, is a key signal. Mates cross introns but stop at the gene ends.

That is simple enough, what surprised me is how reliable this is as a gene-end signal. The signal graph of mate pair spans is high over the gene span, and drops at gene end points. This is especially helpful where genes are smooshed together, with overlapping UTRs (often on reverse strand), and the read coverage may be (equally) high across a span of several genes. In these cases often there is a drop in mate coverage at gene ends.

The simpe useful statistic is of this is the ratio of mate count / read count. This is roughly 1 over exons, much higher than 1 over introns (mates stay high while reads drop), and lower than 1 at gene ends (where mates stop). Also useful is log-mate-ratio = log((matecount+1)/(readcount+1)). The log-mate-ratio cut-offs I found that work well are +2 for intron, and -1 for gene end.

A small trick in calculating this: the mate span here is the inner span between mates, excluding their read length. This count will drop at read-length bases before gene end points. If you include the read span in mate count, their ratio tends to be 1 up to the gene end. Though you can still use the drop in mate count, the ratio isn't as helpful. Program matecovtab.pl counts this way in subroutine readSam().

In these examples, reference gene models are at top (blue), and RNA seq coverage graphs at bottom. Orange track "read_cover" is average read count for four cell lines, a usual expression signal graph. Above that, green track "mate_cover" is the paired mate count graph. The key differece from read cover is there are no drops in signal under introns. These green mountains cover entire gene spans, with drops at gene ends. The intermediate red track "mate_ratio" is the log ratio of these two, with high peaks under introns, and negative troughs at gene ends. The track "stranded RNA seq" at bottom is extra special data from strand-specific RNA seq, showing reverse strand expression below baseline.

dm3L1330k-matecovgene2.png dm3L1329k-matecovgene2.png dm3L1772k-matecovgene2.png

Intermediate tracks "rnaexon", "rnageneend", "rnaintron", "rnalintron", "rnanoscore" are simple run length spans derived from those two read_cover and mate_cover graphs. These match roughly, but well, the reference genes, including gene breaks at overlapped genes.

Software notes

The following Perl programs will produce these results. These are still in development, and may be best used by incorporating in other gene finding programs.

Step 2. matecovtab.pl will produce a mate coverage table from SAM format mapped reads, "accepted_hits.sam" from Tophat.

The table has genome location (chr, start, end) and counts for these signals:

mate_u   : mate coverage (short spans < average exon length, -nearmate=250 default) 
matefar_u: mate coverage over long introns (these can contain other genes)
pread_u  : paired read count 
sread_u  : single read count -- pread_u and sread_u should sum to tophat's coverage.wig table
strand_f : forward strand from intron splices
strand_r : reverse strand from intron splice 
introns  : intron location ID:count
pairs    : mate pair location ID:count 
The strand columns are from spliced reads, where strand is detected by intron splice orientation. These span reads up to splice points, and its mate read. Note one can find both forward and reverse strands in same base span. "read_u" denotes unstranded.

Step 2b. matecovmerge.pl merges 2 or more of these tables. It should be working perfectly, that is a merged table will be identical to results if all hits.sam are processed by matecovtab.pl. Useful for parallel processing and combining subsets.

Step 3. matecov2gene.pl reads that matecov.tab and generates rough gene models (exons, introns, gene ends and noscore) in GFF format.

Step 4. macovsup.pl refines the rough models from matecov2gene.pl

Developed at the Genome Informatics Lab of Indiana University Biology Department