Dear aphid genomicists,
Best gene models can be chosen computationally, in a
replicable way that others can apply to get the same results
without special software. The result is much the same as an
expert looking individually will choose, by scoring models
with the primary gene evidence, and picking the highest
score per locus.
Here is the method I'm working on to choose the best model
per locus from among a few good gene sets (NCBI Refseq v2,
EviGene v2, ACYPI v1). It is based on using gene evidence
to score models, then choosing the highest scored model per
locus. It is simple in basics, the complexity are with
details.
There are two forms of gene evidence we have and can agree
on, protein homology and transcript expression from
EST/RNA-seq. I'm using simplified evidence scoring here to
make sure it can be explained clearly, and especially,
replicated by others of you without special software.
This is a subset of the methods I applied to create this
aphid2_evigene8f gene set.
Protein homology scoring is standard now, using blastp.
Expression scoring can be done in too many ways, with
differing results. We have noted that the aphid expression
assemblies from 4 methods give mostly 4 different results,
with some agreement but not complete agreement. Skip those
as uncertain.
I looked at mapping Rnaseq/EST to transcripts of each
gene set, which has certain values (independent of genome
assembly), but this doesn't work due to transcriptome size
discrepancies (read mapping to identical models is
different).
The expression score that I settled on is precise and
understandable: the span of perfect mate paired RNA-seq
reads that map to in each model, minus perfect mate mistakes
(transcript is missing one of a mapped pair, ..). Using perfect pairs
ensures that gene-model spans are validated, i.e. exon parts
must be paired together to have a high score, and spurious
exon models have no or minus scores.
Don Gilbert, 2011 July, gilbertd at indiana edu
-----------------------------------------------------------
Steps:
1. For valid homology, blastp protein models x Uniprot arthropod genes (may 2011)
+ human + 3 bacteria, using e-value-cut = 1e-5, and take
gene score = highest bitscore.
2. For valid expression, use perfect pairs from aligned reads in aphidpe_*.bam (5 expts)
- score exons for perfect mate pairs (in exon and b/n exons)
- negative score for mate errors: dangles (1 of pair) and wrong strands
- score is read span (bases) covered by perfect or error pairs,
i.e. expression intensity (n.reads/base) doesn't count
gene score = sum over exons (perfect pairs - pair errors)
See program overjoinspan.pl
Each gene model has 2 numeric scores, homology (bitscore) and
perfect mate expression (bases spanned)
3. Gene models are assigned as same locus or different using CDS_exon
location overlap, with equvalence scores of Identical (all same exons +/- 8 bp),
CDS_Same (>90% identical over CDS exons), or partial equivalence. Partial equiv.
includes things like 2 genes from one set span 1 gene of another set.
Percent exon overlap is given.
For 20856 NCBI Refseq x Evigene shared loci, a median of 78% of bases are equivalent for
overlapping models, 16663 have >90% same CDS. For 25309 shared ACYPI1 and Evigene loci,
the median is 46 % of bases, 9850 have >90% same CDS.
See program overgenedup.pl
4. Choose best model per locus as highest joint score (homology,expression),
a large difference in one score outweights small difference of other score.
Categories of best are modelA, modelB, same score , discrepancy (scores disagree)
Evigene x NCBI Refseq, valid loci with both (including alttr) n=9303
(preliminary)
Best model, homology only:
804 Evigene
1655 NCBI Refseq
6844 same score
Best model, homology + expression:
621 discrepancy in scores
3062 Evigene
3281 NCBI Refseq
2339 same
Note there are curated, split-scaffold genes that don't
score properly, in these:
~90 in evigene from ACYPI1 chimeric models
~15 in NCBI Refseq
Example best models (h=homology, x=expression):
-- 1st 10 Evigene, homology only:
acyp2eg0000091t1/C80 h=1858 x=7294 XM_003239995.1 h=1818 x=8026 0
acyp2eg0000130t1/67 h=261 x=650 XM_003240025.1 h=208 x=650 0
acyp2eg0000162t1/79 h=935 x=4683 XM_001952313.2 h=583 x=4683 0
acyp2eg0000168t1/C65 h=242 x=1139 XM_003240008.1/72 h=210 x=1077 0
acyp2eg0000172t1/60 h=266 x=1540 XM_003240010.1 h=146 x=1772 0
acyp2eg0000188t1 h=5360 x=15705 XM_001944527.2/90 h=4587 x=15600 0
acyp2eg0000188t2/C94 h=4563 x=15643 XM_003240012.1 h=3622 x=15286 0
acyp2eg0000287t1/45 h=154 x=447 XM_003240039.1 h=104 x=474 0
acyp2eg0000296t1/42 h=90.9 x=679 XM_003240077.1 h=61.6 x=303 0
acyp2eg0000301t1/C66 h=753 x=1364 XM_001951040.2 h=742 x=911 0
-- 1st 10 NCBI Refseq, homology only
acyp2eg0000028t2 h=559 x=1294 XM_003239985.1/C97 h=575 x=1423 1
acyp2eg0000092t1/C96 h=197 x=1038 NM_001162447.1 h=227 x=995 1
acyp2eg0000101t1/12 h=76.6 x=285 XM_003240022.1 h=112 x=766 1
acyp2eg0000122t1 h=90.1 x=94 XM_003240023.1/26 h=198 x=462 1
acyp2eg0000123t1/72 h=155 x=369 XM_003240023.1 h=198 x=462 1
acyp2eg0000135t1 h=67.0 x=43 XM_001947358.2/6 h=710 x=304 1
# ^ locus has acyp2eg0037010t1, h=946, better than XM_001947358 ; drop acyp2eg0000135t1
acyp2eg0000138t1 h=85.5 x=168 XM_001947258.2/24 h=578 x=488 1
acyp2eg0000139t1/70 h=498 x=342 XM_001947258.2 h=578 x=488 1
acyp2eg0000155t2 h=434 x=2874 XM_003240004.1/90 h=451 x=2868 1
acyp2eg0000160t1/13 h=129 x=235 XM_001952271.2/9 h=162 x=2908 1
# ^comp bug, no exon equivalence for acyp2eg0000160t1 and XM_001952271;
# .. acyp2eg0000160t1 is intronic to XM_001952271, but overlaps ACYPI50442-RA
# .. bug is due to 3way join of paired equivalences: ACYPI50442 = XM_001952271,
# and ACYPI50442 = acyp2eg0000160t1 thru diff. exons
However alt-transcripts of above give same Evigene/NCBI scores:
acyp2eg0000028t1/C98 h=575 x=1294 XM_003239985.1 h=575 x=1423 3
acyp2eg0000155t1/I100 h=451 x=2868 XM_003240004.1 h=451 x=2868 3
There are some large discrepancies:
acyp2eg0000188t1 vs XM_001944527.2 for fragile site-associated gene,
cover same exons with alternates, but NCBI models have lost 500 aminos
due to some computational or assembly mistake. Similarly with
acyp2eg0000162t1 vs XM_001952313.2, Carboxypeptidase D gene, missing 500 aa
but same exons shifted to UTR.
The /number on ID is percent equivalence score, with C=CDS_same, I=identical.
View on gene maps at
http://server7.eugenes.org:8091/gbrowse/cgi-bin/gbrowse/aphid2/?name=acyp2eg0000188t1
#----------------------------------------------------------------
2 example score expression paired gene models:
compare3_genes.gff (including alt-transcripts):
acyr1-ACYPImRNA.gmap.gff : n mrna=35722
acyr2_ncbirefgene.gff : n mrna=20402
aphid2_evigene8f.gff : n mrna=41559
allgenes=genes/compare3_genes.gff
bamset=bams/aphidpe_*.bam
overjoinspan.pl -over $allgenes -format sam -strand $bamset \
> compare3_genes.pespan.gff
3 find equivalent models per locus example:
cat ../acyr2_gnomonxref.gff ../acyr2_ncbirefgene.gff | sed 's/^#a.//;' | grep -v '^#' | \
$evigene/scripts/overgenedup.pl -in ../acyr1-ACYPImRNA.an7.gff -over stdin \
-slopexon=8 -wrongmrna -type similar -act markid -mark overg | grep mRNA | perl -ne \
'($g,$ov)= m/(?:ID|overg)=([^;\s]+)/g; $ov||="none"; print join("\t",$g,$og,$ov),"\n";'\
> equal/acypi1.overncbi2.tab2
4 choose best example:
cut -f2,3 compare3.overid.tab8a | egrep ' [XN][MR]' | perl -ne' chomp; @g=split"\t"; $dd=""; \
$hno=$xno=$i=0; foreach (@g) { ($d)= m|^([^/;,]+)|; $dd.=$d; do{ s/;/;h=0;/; $hno++;} unless(/;h=/); \
$xno++ unless(/;x=\d/); s/;/\t/g; s/,\S+//g; ($h[$i])=m/h=(\d+)/; ($x[$i])=m/x=(\d+)/; $i++; } \
$im=$xm=$hm=0; for $i (0,1){ $x=$x[$i]; $h=$h[$i]; $XXX=$x; if($i==0 or ($x>$xm and $h>$hm)) \
{ $im=$i; $xm=$x; $hm=$h; } elsif(abs($x-$xm)<9 and abs($h-$hm) < 5) { $im=3; }\
elsif($h > $hm and $x+20 > $xm) { $im=$i; $hm=$h;} elsif($x > $xm and $h+3 > $hm) \
{ $im=$i; $xm=$x; } elsif($x < $xm and $h > $hm) { $im=-1; } } $v=join"\t",@g; print "$v\t$im\n" unless($hno or $xno or $did{$dd}++); '\
| cut -f7 | sort | uniq -c
621 -1 discrepancy in scores
3062 0 Evigene
3281 1 NCBI Refseq
2339 3 same
#...................................
http://arthropods.eugenes.org/genes2/pea_aphid2/docs/aphid2_bestgene-compute.txt