DroSpeGe About Arthropods BLAST BioMart Maps Data News

2006 June: data here now use correct CAF1 assemblies.  New SNAP predictions
    using protein homology have been generated which produce closer
    matches to Dmel genes.  See the data/{species/gff sections for caf1_DGIL_SNO

2006 May 14: The assembly version error below has been corrected
   for DroSpeGe services including GBrowse, BLAST and submitted annotations.
   Remaining yet todo: BioMart database, data/{species}/gff/ file access.
   See ftp://eugenes.org/eugenes/genomes/caf1a/ for corrected SNAP predictions.
   and ftp://eugenes.org/eugenes/genomes/caf1a/wrong-caf1/ for correction methods.

2006 May 13:

Ooops!  The prize for funniest annotation mistake goes to ...

*** Don Gilbert, who used the wrong assemblies ***

I used the wrong **CAF1 assemblies**.  Yes, there were two versions
for dana, dere, dgri, dmoj, dvir.  How I missed this obvious change I
don't know. I knew of and was working with these various assemblies,
and "checked and verfied" that I was working with the corrected
assemblies.  I did use the right assemblies for dper, dsec, dpse,
dmel, dsim, dwil, dyak.

My apologies to those of you who have used these annotations, and
other DroSpeGe services labelled as Comparative Annotation Freeze 1.
I'm correcting this now, and should have these in available in a few
days, including  GFF annotations, BLAST, GBrowse and BioMart services.

It is scaffold offset error due to different numbers of "N" spacers
inserted between contigs to join a scaffold in the final CAF1. I'm
correcting by changing the locations in GFF files of BLAST and gene
finding results. For any of you who may have data derived from this
and want to correct locations, I am providing the old and new changes
as AGP files (already part of current CAF1 fileset), and a small perl
script analogous to UCSC's "liftAgp" for updating GFF files.

I've loaded the various CAF1 annotations to view in GBrowse 
and other dxxx-hsg/ now labelled "Annot2".

*** And my prize for best new gene predictions goes to ***

SNAP provides the closest call to new Drosophila reprodutive genes
identified by lab work of David Begun and colleagues than the other

Someone at UC Davis should give Ian Korf a pat on the back for this.
You say, since I created the contest, and awarded Ian and myself the
prize ... see here for your own conclusions:

NCBI Gnomon, combining ab initio and homology tests, came in second
for matching these new repro genes.

Fly gene HSP annotations (secondary Dmel protein tblastn matches) did
well at matching the new genes. These are used also in estimating
gene gain/loss via Gene Ontology groupings, and probably would make a
useful addition to public annotations of these species.

SNAP's good performance surprised me more than finding I'd used the wrong
assemblies (this result isn't confounded by the assembly error). I ran
SNAP following out-of-the-box directions, with no special effort to
tune its HMM models beyond feeding it species-specific genome dna for
training.  This is a useful prediction set if you want to locate new
genes, have some alternate evidence, and don't mind wading thru
spurious predictions.

Keep in mind that SNAP predictions use no homology informtion (beyond
starting with a D.melanogaster trained HMM), while I gather  the
other annotation methods used homology to Dmel in various ways.  

In good-Dmel regions (high Dmel homology) SNAP and the others pretty
much called all the same CDS-exons.  SNAP's gene models leave
something to be desired in a number of cases - not matching Dmel ones.

It is in the weak-homology regions that the various predictors have
differing results.  My eyeball test, using Dmel protein matches as
an independent reference, says each method made mistakes the others didn't,
and no one method seemed a lot better than others.  I've a favorite
for the good-Dmel and not-Dmel region predictors, but will reserve

In summary of annotations at
* DGIL: the older CAF1 assemblies I used are labelled
dana_caf051209, dana_caf051209, dgri_caf051209, dmoj_caf051209,
dvir_caf051209, with the release date (2005-12-09) have the same
sequence (contigs.bases) produced via reconciliation by J. Yorke and
UMD colleages, and scaffold assembly order.   The current CAF1
assemblies are dated 2006-02-10.  What changed between 1st and 2nd
CAF1 release of these is the number of N spacers added between contigs
in creating scaffolds.  So the error isn't as bad as might be: those
who made use of these via BLAST, Gbrowse, etc. will find the results
meaningful in terms of found annotations and relative locations.
However the insertion/deletion of extra N's between contigs in a
scaffold offsets the absolute locations as I provided, compared to
other CAF1 annotations.

There are a few minor changes I made to other Annotation_Submission
GFF, but mostly these data are clean:
*  Variation in naming chromosome/scaffold:
  - Some added 'species' prefixes to these.
  - Broad's use of 'super_' led some of use to change to 'scaffold_' for
    consistency, and some not.
  - Chromosome 'Chr', 'Ch' or no leading prefix.
*  EISE_CEX Exonerate predictions have the wrong GFF strand for many,
   but the correct strand is in the GFF embedded in 'localid' field.
*  NCBI, OXFD used 'PARENT=' where GFF3 requires human case "Parent="
*  For local use, I changed GFF source field to GROUP_METHOD
     EISE/genemapper:  s/GeneMapper/EISE_CGM/;
     EISE/exonerate/:  s/Exonerate/EISE_CEX/;
     PACH_GMP: s/GeneMapper/PACH_GMP/;

-- Don Gilbert, 13 May 2006

Developed at the Genome Informatics Lab of Indiana University Biology Department