Index of /species/data/dsec/gff
Name Last modified Size Description
Parent Directory 13-Nov-2006 12:28 -
dsec-dmel-algn.gff.gz 11-Nov-2005 15:13 520k
dsec-dmel-dna.gff.gz 11-Nov-2005 15:10 13.2M
dsec-markers.gff.gz 10-Nov-2005 23:18 14k
dsec-prot9-strong2.gff.gz 10-Nov-2005 15:33 11.2M
dsec-prot9-weak2.gff.gz 09-Mar-2006 17:04 13.4M
dsec-scaffolds.gff 07-Nov-2005 18:10 3.1M
dsec_caf1_DGIL-snap.hmm 08-Feb-2006 01:27 45k
dsec_caf1_DGIL.aa.gz 02-May-2006 23:09 5.5M
dsec_caf1_DGIL.gff.gz 14-Aug-2006 12:59 1.7M
dsec_caf1_DGIL.tr.gz 02-May-2006 23:09 8.7M
dsec_caf1_DGIL_SNO.aa.gz 30-May-2006 11:54 5.3M
dsec_caf1_DGIL_SNO.gff.gz 13-Aug-2006 22:15 1.8M
dsec_caf1_DGIL_SNO.hmm 22-May-2006 22:10 47k
dsec_caf1_DGIL_SNO.tr.gz 30-May-2006 11:54 8.5M
dsecprot9-hsp.gff.gz 16-May-2006 14:00 3.0M
gff.toplevel 07-Nov-2005 18:10 3.1M
Drosophila species genomes URL: ftp://eugenes.org/eugenes/genomes/caf1a/
for drosophila species GFF annotations, CAF1 recommended reformatting
http://rana.lbl.gov/drosophila/wiki/index.php/Annotation_Coordination
don gilbert, apr06, gilbertd@indiana.edu
File sets for twelve drosophila species (dana .. dyak):
1. SNAP gene predictions (no homology guidance; pure ab-initio; DGIL_SNP is source tag)
dana_caf1_DGIL.gff.gz (GFF features), .aa.gz (FastA predicted proteins), .tr.gz (Fasta transcripts)
dana_caf1_DGIL.gff.md5 is MD5 checksum of uncompressed file
2. Model organism Protein matches (Dmel and eight other proteomes; see below notes)
danaprot9-hsp.gff.gz ... prot9-hsp.gff.gz
3. SNAP gene predictions with Dmel protein homology guidance (DGIL_SNO is source tag)
This SNO version is better quality than (1) SNP by sensitivity & specificity statistics.
dana_caf1_DGIL_SNO.gff.gz (GFF features), .aa.gz (FastA predict proteins), .tr.gz (Fasta transcripts)
4. D.melanogaster new gene and exon predictions, outside of and inside of known genes
that have been validated with a high synonymous substitution rate (low Ka_Ks)
in other Drosophila species. See http://insects.eugenes.org/species/news/newgenes-dmel/
all_caf1_DGIL_TEX.gff
NOTES:
* 23 Aug 06: added all_caf1_DGIL_TEX.gff, new Dmel gene predictions
For each predicted, phylogenetically validated exon (texon1..9999 outside of known dmel genes)
location on each species is given. The "all_" refers to Drosophila_melanogaster, D_sechellia,
D_yakuba, D_erecta, D_ananassae, D_pseudoobscura, D_mojavensis
* 14 Aug 06: added phase values to coding exons in GFF (column 8),
and attribute 'partial_gene=true' when it is a partial prediction.
Phase only matters for partial gene predictions when making aa translation,
use the phase of 1st exon (0,1,2) to offset where to start translation.
Partial gene predictions occur at ends of scaffolds (where predictor
ran out of bases). Dmel, Dsim, Dyak, Dpse have 0 to 10 partials, while
assemblies with 1000s of small scaffolds have 1-2K partials.
* 9 Jun 06: corrected Dpse chromosome names to CAF1 standard by adding 'Ch' prefix.
Note the Dpse DGIL sets use the flybase standard for dpse with "U" chromosome the
collection of 'Unknown..' parts.
* 30 may 06: added SNO Snap + homology predictions (produces better gene models)
* 16 may 06: added model organism protein tblastn matches (see below), as *-hsp.gff.gz
* 14 may 06: Corrected the GFF for dana,dere,dgri,dmoj,dvir to use correct CAF1(b) scaffold
locations (spacer offset error). See here for further info
http://insects.eugenes.org/species/news/oops-data-error.html
See wrong-caf1/ folder for AGP files of old, new CAF1 assemblies and perl to
adjust GFF locations from old to new.
* 2 May 06: updated with FastA of predicted transcripts (.tr) and proteins (.aa)
Example GFF file format:
==> dana_caf1_DGIL.gff <==
##gff-version 3
#species: Drosophila_ananassae
#assembly-id: dana_caf1
#annotation-group-id: DGIL_SNP
#algorithm: DGIL_SNP = SNAP gene predictor, version 2005-07-27,
# : SNAP ref=http://www.biomedcentral.com/1471-2105/5/59/abstract
# : bootstrapped HMM predictor from Dmelanogaster.hmm on Drosophila_ananassae assembly dna
# : http://insects.eugenes.org/species/data/dana/gff/snap-dana_caf051209.hmm
#authors: gilbertd @ indiana.edu
#more-info: http://insects.eugenes.org/DroSpeGe/data/dana/
#date: 20060210
scaffold_1 DGIL_SNP gene 863 1326 . - . ID=GF_DGIL_SNP_28000001
scaffold_1 DGIL_SNP exon 1282 1326 10.913 - . Parent=GF_DGIL_SNP_28000001
scaffold_1 DGIL_SNP exon 1093 1137 14.326 - . Parent=GF_DGIL_SNP_28000001
#----------------------------------
Model organism protein blast matches (HSP groups):
Source proteomes are taken from model organism databases.
modDM = Drosophila melanogaster (flybase) proteome
modMM = Mus musculus (MGI) proteome
modCE = C. elegans (WormBase) proteome
modSC = Sacc. cer. (SGD) proteome
Protein tBLASTn matches are grouped by HSP overlap. Names are protein ID + HSP group.
ID Key: CG00000_G1 = primary Gene match (best hit), CG0000_G2 = secondary Gene match on
same scaffold, CG0000_S[3..n] = tertiary and further matches of same protein on same scaffold as _G1,
CG0000_o[1..n] = further matches on other scaffolds. Matches at p <= 1e-3 are collected,
secondary matches that mostly overlap better matches are removed. HSPs are grouped by both
protein fragment overlap (same part of protein) and target genome overlap. HSP groups
include several gene events: alternate splice exons in same gene, tandem and distant duplications,
new genes composed of parts of several other genes, as well as computational artifacts.
See <a href="/species/news/genome-summaries/gene-GO-function-association/">
here for further details and use in Gene Ontology groups</a> D. Gilbert, may 06
Example HSP GFF:
source field = MOD database; score = blast bitscore; Parent= only for HSPs (no ID) with ID as above.
tkey = protein target HSP group (location subset); tloc = protein target location; align = no. aa residues
aligned
##gff-version 3
scaffold_13340 modMM HSP 13833673 13833942 79.0 - . Parent=MGI:88491_G1;tk
ey=MGI:88491-HSP:23-120;tloc=23-120;align=98
scaffold_13340 modDM HSP 11004286 11004747 230 + . Parent=CG8236_G1;tkey=
CG8236-PA-HSP:1-153;tloc=1-153;align=154
#-------------------------------
SNAP + homology guidance notes:
With Ian Korf's kind help, I've learned how SNAP can use protein homologies to
train and guide gene calls. This produces a much closer gene mapping where
there is homology, yet retains unique gene calls in non-homologous regions.
Use 'snap -xdef protein-hsp.zff' for bootstrap training hmm as well as prediction.
General prediction script:
# -ACoding -0.001 -AStart -2 -AStop -2 is used to lower hmmtrain influence relative to hsp.zff
# for training only, not for prediction.
set hmmtrain=fly
echo "bootstrap snap $hmmtrain $dpid-train.fa .."
/bin/cp /dev/null snaptrain/$dpid-train.fa
foreach chr ($chrs)
cat $dspp-hsp.zff | perl -ne "print if(/$chr\b/);" > ! snapout/hsp-$chr.zff
cat $scd/perchr/$chr.fa >> snaptrain/$dptrain.fa
$ZOE/snap -quiet -xdef snapout/hsp-$chr.zff \
-ACoding -0.001 -AStart -2 -AStop -2 \
$hmmtrain $scd/perchr/$chr.fa > snaptrain/$dpid-$chr-$hmmtrain.zff
end
cd snaptrain
cp /dev/null $dptrain.zff
foreach chr ($chrtr)
cat $dpid-$chr-$hmmtrain.zff >> $dptrain.zff
end
$ZOE/fathom $dptrain.zff $dptrain.fa -categorize 1000
$ZOE/fathom uni.ann uni.dna -export 1000 -plus
$ZOE/forge export.ann export.dna
cd ../
$ZOE/hmm-assembler.pl -o ${dpid}-snapho snaptrain > snapho-$dpid.hmm
set hmm=snapho-$dpid.hmm
$ZOE/snap -quiet -name 'snapho' -gff3 -xdef snapout/hsp-$chr.zff \
-aa snapout/$chr.aa -tx snapout/$chr.tr \
$hmm $scd/perchr/$chr.fa > snapout/$chr.gff
#-----------------------------------------------------
|