See below a brief description of these data files for
Nasonia vit. NCBI Gnomon predictions, not in RefSeq or Glean6 gene sets
-- Don Gilbert
Name Last modified Size Description
Parent Directory 04-Feb-2010 14:57 -
notglean_gnomon.blasttab 05-Dec-2008 12:44 482k
notglean_gnomon.hmm 12-Nov-2008 18:50 206k
notglean_gnomon_est.hmm 05-Dec-2008 15:05 6k
notglean_gnomon_evch-phylo_dist.txt 15-Nov-2008 17:30 18k
notglean_gnomon_evch.hmm 05-Dec-2008 15:06 27k
notglean_gnomon_notevchormcl.hmm 05-Dec-2008 15:07 75k
notglean_gnomon_ortholog.e15.hmm 05-Dec-2008 15:07 8k
notglean_good.gene.gff 05-Dec-2008 16:46 537k
notglean_good_evidence.tab 05-Dec-2008 17:10 294k
notglean_omclgn2sum.hmm 05-Dec-2008 15:05 244k
notglean_tegene.hmm 05-Dec-2008 15:05 61k
notglean_wellknown.gnomon.gff 12-Nov-2008 20:30 9k
pasa_nasv.novelgenes.gff 05-Dec-2008 17:40 303k
pasa_nasv.novelgenes.protein.fa 05-Dec-2008 17:34 122k
pasa_nasv.novelgenes.transcript.fa 05-Dec-2008 17:37 289k
pasa_nasv.status.gno.tab 05-Dec-2008 14:00 721k
work/ 05-Dec-2008 17:08 -
Nasonia vit. NCBI Gnomon predictions, not in RefSeq or Glean6 gene sets
08Dec05 update:
notglean_good_evidence.tab : table of additional gene predictions with evidence
notglean_good.gene.gff : NCBI Gnomon mRNA GFF lines with added evidence annotation
These include 2271 genes with protein homology and/or EST evidence,
but lacking Transposon (TEgene) annotation from Chris Smith's analysis,
out of 8906 total gene predictions not in RefSeq + Glean6 sets (file notglean_gnomon.hmm)
notglean_good_evidence.tab table lists Gnomon_ID, Evidence flags as EST, Homology,
Ortholog gene IDs with blast score, and Arthropod gene cluster id.
notglean_good.gene.gff has this same information pasted into attributes of Gnomon GFF
==================== Background =================
Exon data:
% wc -l *exons
90614 nasonia_glean6.exons
119571 nasv_pred_gnomon.exons
33644 notglean_gnomon.exons < subset with no glean6 gene overlap
(removed all exons with any exon overlapping glean6 gene )
Gene ID lists from exons:
% wc -l *gnomon.gids
17386 hasglean_gnomon.gids < overlap glean6
9665 notglean_gnomon.gids < no overlap
RefSeq genes in above:
% grep -c '^LOC' *gnomon.gids
hasglean_gnomon.gids:8395 < RefSeq genes in glean6
notglean_gnomon.gids:759 < RefSeq genes not in glean6
Remainder: Gnomon predictions not in Glean6 or RefSeq:
8906 notglean_notrefseq_gnomon.gids
Orthology from BlastP to 12 arthropod (10 insect) proteomes.
8055 notglean_notrefseq_gnomon match some other gene (same or different species)
using BlastP evalue <= 1e-5 recipr matches from arthropod orthology,
242 notglean_notrefseq_gnomon match another species, at evalue <= 1e-15
= file: notglean_gnomon_ortholog.e15.hmm
4920 notglean_notrefseq_gnomon match other Nasonia gene at evalue <= 1e-15
many/most of these seem to be transposon genes (form largest clusters of 100s)
Arthropod gene cluster descriptions for notglean_gnomon_ortholog
file: notglean_omclgn2sum.e15.arpdesc
Count of these nas-notglean genes with insect orthologs:
Nas No.
Genes Taxa
85 ntaxa: 2
33 ntaxa: 3
22 ntaxa: 4
13 ntaxa: 5
9 ntaxa: 6
4 ntaxa: 7
1 ntaxa: 8
4 ntaxa: 9
9 ntaxa: 10
11 ntaxa: 11 | These 50 are well known insect/arthropod orthologs.
19 ntaxa: 12 | Should be in RefSeq but are not.
21 ntaxa: 13 | 36 are called Pseudogene, expert should look at, could be non-pseudo.
file: notglean_wellknown.gnomon.gff
-----------------------
Jack,
The count by your criteria is less than I found when I look with other criteria,
but you win this bet, with 242 of these missed orthology genes.
Find here these gene lists and supporting data (let me know what is unclear):
http://insects.eugenes.org/arthropods/data/nasonia/notglean/
This is the Gnomon gene file for the not-glean-not-refseq with orthologs, evalue < e-15:
notglean_ortholog.e15.gnomon.gff
About 50 of these are valuable 1-1 orthologs across 10 other insects, genes
you don't want to miss in Nasonia. Nowever Gnomon calls 36 of these as
Pseudogenes. An expert should look at them as Gnomon and others can mistake
the end of a scaffold or NNN error for a pseudogene.
notglean_wellknown.gnomon.gff
This is a table of orthologs with my ARP ID. Some have useful
descriptions, others no description or hypothetical protein:
notglean_omclgn2sum.e15.arpdesc
This table summarizes all the blastp matches for these as gene pairs with e-values:
notglean_omclgn2sum.tab
The 242 misses are a small enough count you can work them into your current gene set
w/o much effort. The majority of the ones that turn up as matching other genes
are matching other Nasonia genes, presumably many transposons,
5000 of the 8000 non-glean set are this variety. Some, possibly many, are likely real wasp
genes with paralogs.
I can get a higher count of possible orthologs among these notglean predictions:
many of the Nasonia genes that have significant paralogs but not themselves cross-species matches
fall in orthology clusters with other species. There are about 1000 notglean genes with
this possible orthology.
The OrthoMCL clustering is saying there is common homology here,
you may or may not agree with that, but it gives a basis for thinking these may be
real but derived genes. Just to pick one at random, ARP1_G548 "Odorant receptor 30aCG13106-PA;"
is a cluster of one Apis gene and 23 Nasonia genes, 3 of which are in this Nasonia-only
blastp matching category. Which of the 23 listed here would seem false positives?
http://insects.eugenes.org/genepage/arthropod/ARP1_G548
You can read more about use of OrthoMCL for detecting orthology here:
OrthoMCL: http://www.orthomcl.org/
Li Li, Christian J. Stoeckert, Jr., and David S. Roos
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003 13: 2178-2189.
Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos
Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes.
PLoS ONE 2007 2(4): e383.
I applied the methods as described in these papers.
- Don
|