| |||||||||
2006 June: data here now use correct CAF1 assemblies. New SNAP predictions using protein homology have been generated which produce closer matches to Dmel genes. See the data/{species/gff sections for caf1_DGIL_SNO files. 2006 May 14: The assembly version error below has been corrected for DroSpeGe services including GBrowse, BLAST and submitted annotations. Remaining yet todo: BioMart database, data/{species}/gff/ file access. See ftp://eugenes.org/eugenes/genomes/caf1a/ for corrected SNAP predictions. and ftp://eugenes.org/eugenes/genomes/caf1a/wrong-caf1/ for correction methods. 2006 May 13: Ooops! The prize for funniest annotation mistake goes to ... *** Don Gilbert, who used the wrong assemblies *** I used the wrong **CAF1 assemblies**. Yes, there were two versions for dana, dere, dgri, dmoj, dvir. How I missed this obvious change I don't know. I knew of and was working with these various assemblies, and "checked and verfied" that I was working with the corrected assemblies. I did use the right assemblies for dper, dsec, dpse, dmel, dsim, dwil, dyak. My apologies to those of you who have used these annotations, and other DroSpeGe services labelled as Comparative Annotation Freeze 1. I'm correcting this now, and should have these in available in a few days, including GFF annotations, BLAST, GBrowse and BioMart services. It is scaffold offset error due to different numbers of "N" spacers inserted between contigs to join a scaffold in the final CAF1. I'm correcting by changing the locations in GFF files of BLAST and gene finding results. For any of you who may have data derived from this and want to correct locations, I am providing the old and new changes as AGP files (already part of current CAF1 fileset), and a small perl script analogous to UCSC's "liftAgp" for updating GFF files. I've loaded the various CAF1 annotations to view in GBrowse http://insects.eugenes.org/species/cgi-bin/gbrowse/dyak-hsg/ and other dxxx-hsg/ now labelled "Annot2". *** And my prize for best new gene predictions goes to *** SNAP provides the closest call to new Drosophila reprodutive genes identified by lab work of David Begun and colleagues than the other annotations. Someone at UC Davis should give Ian Korf a pat on the back for this. You say, since I created the contest, and awarded Ian and myself the prize ... see here for your own conclusions: http://insects.eugenes.org/species/news/genome-summaries/geneprediction-test/ NCBI Gnomon, combining ab initio and homology tests, came in second for matching these new repro genes. Fly gene HSP annotations (secondary Dmel protein tblastn matches) did well at matching the new genes. These are used also in estimating gene gain/loss via Gene Ontology groupings, and probably would make a useful addition to public annotations of these species. http://insects.eugenes.org/species/news/genome-summaries/gene-GO-function-association/ SNAP's good performance surprised me more than finding I'd used the wrong assemblies (this result isn't confounded by the assembly error). I ran SNAP following out-of-the-box directions, with no special effort to tune its HMM models beyond feeding it species-specific genome dna for training. This is a useful prediction set if you want to locate new genes, have some alternate evidence, and don't mind wading thru spurious predictions. Keep in mind that SNAP predictions use no homology informtion (beyond starting with a D.melanogaster trained HMM), while I gather the other annotation methods used homology to Dmel in various ways. In good-Dmel regions (high Dmel homology) SNAP and the others pretty much called all the same CDS-exons. SNAP's gene models leave something to be desired in a number of cases - not matching Dmel ones. It is in the weak-homology regions that the various predictors have differing results. My eyeball test, using Dmel protein matches as an independent reference, says each method made mistakes the others didn't, and no one method seemed a lot better than others. I've a favorite for the good-Dmel and not-Dmel region predictors, but will reserve that. In summary of annotations at http://rana.lbl.gov/drosophila/wiki/index.php/Annotation_Submission * DGIL: the older CAF1 assemblies I used are labelled dana_caf051209, dana_caf051209, dgri_caf051209, dmoj_caf051209, dvir_caf051209, with the release date (2005-12-09) have the same sequence (contigs.bases) produced via reconciliation by J. Yorke and UMD colleages, and scaffold assembly order. The current CAF1 assemblies are dated 2006-02-10. What changed between 1st and 2nd CAF1 release of these is the number of N spacers added between contigs in creating scaffolds. So the error isn't as bad as might be: those who made use of these via BLAST, Gbrowse, etc. will find the results meaningful in terms of found annotations and relative locations. However the insertion/deletion of extra N's between contigs in a scaffold offsets the absolute locations as I provided, compared to other CAF1 annotations. There are a few minor changes I made to other Annotation_Submission GFF, but mostly these data are clean: * Variation in naming chromosome/scaffold: - Some added 'species' prefixes to these. - Broad's use of 'super_' led some of use to change to 'scaffold_' for consistency, and some not. - Chromosome 'Chr', 'Ch' or no leading prefix. * EISE_CEX Exonerate predictions have the wrong GFF strand for many, but the correct strand is in the GFF embedded in 'localid' field. * NCBI, OXFD used 'PARENT=' where GFF3 requires human case "Parent=" * For local use, I changed GFF source field to GROUP_METHOD EISE/genemapper: s/GeneMapper/EISE_CGM/; EISE/exonerate/: s/Exonerate/EISE_CEX/; PACH_GMP: s/GeneMapper/PACH_GMP/; -- Don Gilbert, 13 May 2006 |