DroSpeGe About BLAST BioMart Maps Data News

Index of /species/data/dsec/gff

      Name                      Last modified       Size  Description

[DIR] Parent Directory 13-Nov-2006 12:28 - [   ] dsec-dmel-algn.gff.gz 11-Nov-2005 15:13 520k [   ] dsec-dmel-dna.gff.gz 11-Nov-2005 15:10 13.2M [   ] dsec-markers.gff.gz 10-Nov-2005 23:18 14k [   ] dsec-prot9-strong2.gff.gz 10-Nov-2005 15:33 11.2M [   ] dsec-prot9-weak2.gff.gz 09-Mar-2006 17:04 13.4M [TXT] dsec-scaffolds.gff 07-Nov-2005 18:10 3.1M [TXT] dsec_caf1_DGIL-snap.hmm 08-Feb-2006 01:27 45k [   ] dsec_caf1_DGIL.aa.gz 02-May-2006 23:09 5.5M [   ] dsec_caf1_DGIL.gff.gz 14-Aug-2006 12:59 1.7M [   ] dsec_caf1_DGIL.tr.gz 02-May-2006 23:09 8.7M [   ] dsec_caf1_DGIL_SNO.aa.gz 30-May-2006 11:54 5.3M [   ] dsec_caf1_DGIL_SNO.gff.gz 13-Aug-2006 22:15 1.8M [TXT] dsec_caf1_DGIL_SNO.hmm 22-May-2006 22:10 47k [   ] dsec_caf1_DGIL_SNO.tr.gz 30-May-2006 11:54 8.5M [   ] dsecprot9-hsp.gff.gz 16-May-2006 14:00 3.0M [TXT] gff.toplevel 07-Nov-2005 18:10 3.1M



 Drosophila species genomes  URL: ftp://eugenes.org/eugenes/genomes/caf1a/
 for drosophila species GFF annotations, CAF1 recommended reformatting
 http://rana.lbl.gov/drosophila/wiki/index.php/Annotation_Coordination

 don gilbert, apr06, gilbertd@indiana.edu

 File sets for twelve drosophila species (dana .. dyak):
 1. SNAP gene predictions (no homology guidance; pure ab-initio; DGIL_SNP is source tag) 
    dana_caf1_DGIL.gff.gz (GFF features), .aa.gz (FastA predicted proteins), .tr.gz (Fasta transcripts)
    dana_caf1_DGIL.gff.md5 is MD5 checksum of uncompressed file

 2. Model organism Protein matches (Dmel and eight other proteomes; see below notes)
    danaprot9-hsp.gff.gz  ... prot9-hsp.gff.gz

 3. SNAP gene predictions with Dmel protein homology guidance (DGIL_SNO is source tag)
    This SNO version is better quality than (1) SNP by sensitivity & specificity statistics.
    dana_caf1_DGIL_SNO.gff.gz (GFF features), .aa.gz (FastA predict proteins), .tr.gz (Fasta transcripts)

 4. D.melanogaster new gene and exon predictions, outside of and inside of known genes
    that have been validated with a high synonymous substitution rate (low Ka_Ks) 
    in other Drosophila species.  See http://insects.eugenes.org/species/news/newgenes-dmel/
    all_caf1_DGIL_TEX.gff

 NOTES:
 * 23 Aug 06: added all_caf1_DGIL_TEX.gff, new Dmel gene predictions
    For each predicted, phylogenetically validated exon (texon1..9999 outside of known dmel genes)   
    location on each species is given. The "all_" refers to  Drosophila_melanogaster, D_sechellia,
    D_yakuba, D_erecta, D_ananassae, D_pseudoobscura, D_mojavensis

 * 14 Aug 06: added phase values to coding exons in GFF (column 8),
              and attribute 'partial_gene=true' when it is a partial prediction.
              Phase only matters for partial gene predictions when making aa translation,
              use the phase of 1st exon (0,1,2) to offset where to start translation.
              Partial gene predictions occur at ends of scaffolds (where predictor
              ran out of bases).  Dmel, Dsim, Dyak, Dpse have 0 to 10 partials, while
              assemblies with 1000s of small scaffolds have 1-2K partials.

 *  9 Jun 06: corrected Dpse chromosome names to CAF1 standard by adding 'Ch' prefix.
            Note the Dpse DGIL sets use the flybase standard for dpse with "U" chromosome the
            collection of 'Unknown..' parts.

 * 30 may 06:  added SNO Snap + homology predictions (produces better gene models)

 * 16 may 06:  added model organism protein tblastn matches (see below), as *-hsp.gff.gz

 * 14 may 06: Corrected the GFF for dana,dere,dgri,dmoj,dvir to use correct CAF1(b) scaffold
    locations (spacer offset error). See here for further info
    http://insects.eugenes.org/species/news/oops-data-error.html
    See wrong-caf1/ folder for AGP files of old, new CAF1 assemblies and perl to
    adjust GFF locations from old to new.

 * 2 May 06:  updated with FastA of predicted transcripts (.tr) and proteins (.aa)

Example GFF file format:
==> dana_caf1_DGIL.gff <==
##gff-version   3
#species: Drosophila_ananassae
#assembly-id: dana_caf1
#annotation-group-id: DGIL_SNP
#algorithm: DGIL_SNP = SNAP gene predictor, version 2005-07-27, 
#         : SNAP ref=http://www.biomedcentral.com/1471-2105/5/59/abstract
#         : bootstrapped HMM predictor from Dmelanogaster.hmm on Drosophila_ananassae assembly dna
#         : http://insects.eugenes.org/species/data/dana/gff/snap-dana_caf051209.hmm
#authors: gilbertd @ indiana.edu
#more-info: http://insects.eugenes.org/DroSpeGe/data/dana/
#date: 20060210

scaffold_1      DGIL_SNP        gene    863     1326    .       -       .       ID=GF_DGIL_SNP_28000001
scaffold_1      DGIL_SNP        exon    1282    1326    10.913  -       .       Parent=GF_DGIL_SNP_28000001
scaffold_1      DGIL_SNP        exon    1093    1137    14.326  -       .       Parent=GF_DGIL_SNP_28000001
#----------------------------------


Model organism protein blast matches (HSP groups):
Source proteomes are taken from model organism databases.
  modDM = Drosophila melanogaster (flybase) proteome
  modMM = Mus musculus (MGI) proteome
  modCE = C. elegans (WormBase) proteome
  modSC = Sacc. cer. (SGD) proteome
Protein tBLASTn matches are  grouped by HSP overlap.  Names are protein ID + HSP group.
  ID Key: CG00000_G1 = primary Gene match (best hit), CG0000_G2 = secondary Gene  match on 
  same scaffold, CG0000_S[3..n] = tertiary and further matches of same protein on same scaffold as _G1,
  CG0000_o[1..n] = further matches on other scaffolds.  Matches at p <= 1e-3 are collected,
  secondary matches that mostly overlap better matches are removed.  HSPs are grouped by both
  protein fragment overlap (same part of protein) and target genome overlap.  HSP groups
  include several gene events: alternate splice exons in same gene, tandem and distant duplications,
  new genes composed of parts of several other genes, as well as computational artifacts.
  See <a href="/species/news/genome-summaries/gene-GO-function-association/">
  here for further details and use in Gene Ontology groups</a> D. Gilbert, may 06

Example HSP GFF:
  source field = MOD database; score = blast bitscore; Parent= only for HSPs (no ID) with ID as above.
  tkey = protein target HSP group (location subset); tloc = protein target location; align = no. aa residues 
  aligned
##gff-version 3
scaffold_13340  modMM   HSP     13833673        13833942        79.0    -       .       Parent=MGI:88491_G1;tk
ey=MGI:88491-HSP:23-120;tloc=23-120;align=98
scaffold_13340  modDM   HSP     11004286        11004747         230    +       .       Parent=CG8236_G1;tkey=
CG8236-PA-HSP:1-153;tloc=1-153;align=154

#-------------------------------

SNAP + homology guidance notes:

With Ian Korf's kind help, I've learned how SNAP can use protein homologies to
train and guide gene calls.  This produces a much closer gene mapping where
there is homology, yet retains unique gene calls in non-homologous regions.
Use 'snap -xdef protein-hsp.zff' for bootstrap training hmm as well as prediction.

General prediction script:
# -ACoding -0.001 -AStart -2 -AStop -2 is used to lower hmmtrain influence relative to hsp.zff
# for training only, not for prediction.

set hmmtrain=fly 
echo "bootstrap snap $hmmtrain $dpid-train.fa .."

/bin/cp /dev/null snaptrain/$dpid-train.fa
foreach chr ($chrs)
cat $dspp-hsp.zff | perl -ne "print if(/$chr\b/);" > ! snapout/hsp-$chr.zff
cat $scd/perchr/$chr.fa >> snaptrain/$dptrain.fa
$ZOE/snap -quiet -xdef snapout/hsp-$chr.zff \
 -ACoding -0.001 -AStart -2 -AStop -2 \
 $hmmtrain $scd/perchr/$chr.fa > snaptrain/$dpid-$chr-$hmmtrain.zff
end

cd snaptrain
cp /dev/null $dptrain.zff
foreach chr ($chrtr)
  cat $dpid-$chr-$hmmtrain.zff >> $dptrain.zff
end

$ZOE/fathom $dptrain.zff $dptrain.fa -categorize 1000
$ZOE/fathom uni.ann uni.dna -export 1000 -plus
$ZOE/forge export.ann export.dna

cd ../
$ZOE/hmm-assembler.pl -o ${dpid}-snapho snaptrain > snapho-$dpid.hmm
set hmm=snapho-$dpid.hmm
  
$ZOE/snap -quiet -name 'snapho' -gff3 -xdef snapout/hsp-$chr.zff \
 -aa snapout/$chr.aa -tx snapout/$chr.tr \
 $hmm $scd/perchr/$chr.fa > snapout/$chr.gff
#-----------------------------------------------------


Developed at the Genome Informatics Lab of Indiana University Biology Department