Preliminary comparison of Drosophila species gene predictions This is a rough estimate of sensitivity at new and old gene finding, along with specificity at calling genes only where there is evidence. This is for exon-level matches, using a 1 KB window on genomes (exons overlap if in same 1kb window). Don't read too much into these numbers, as the comparison used is an approximation only, to suggest where to look further. - Don Gilbert, updated 16 August 2006 Predictions from http://rana.lbl.gov/drosophila/wiki/index.php/Annotation_Submission ------------------- Sensitivity for "Old genes" DL_SNP DL_SNO EE_CEX EE_CGM EE_CGW NI_GNO PH_GMP BN_NSC RI_GID OD_GPI BZ_CNA EE_GLN MD_JIG EE_GLR RANDOM Dana 0.923 0.930 0.886 0.917 0.872 0.933 0.896 0.938 0.912 0.857 0.903 0.966 0.869 0.948 0.504 Dmoj 0.845 0.949 0.894 0.903 0.941 0.939 0.877 0.952 0.906 0.836 0.894 0.975 0.869 0.958 0.502 Dpse 0.941 0.942 0.899 0.892 0.929 0.932 0.867 0.940 0.925 0.831 0.922 0.959 0.869 0.946 0.500 Dvir 0.916 0.945 0.900 0.905 0.866 0.940 0.876 0.953 0.903 0.842 0.892 0.971 0.864 0.960 0.500 Dyak 0.926 0.935 0.914 0.931 0.862 0.935 0.897 0.927 0.916 0.905 0.912 0.967 0.882 0.953 0.500 Dsim 0.934 0.939 0.912 0.899 0.877 0.946 0.862 0.918 0.910 0.791 0.913 0.968 0.824 0.940 0.501 ---- mean 0.914 0.940 0.901 0.908 0.891 0.938 0.879 0.938 0.912 0.844 0.906 0.968 0.863 0.951 0.501 Sensitivity for "New genes" DL_SNP DL_SNO EE_CEX EE_CGM EE_CGW NI_GNO PH_GMP BN_NSC RI_GID OD_GPI BZ_CNA EE_GLN MD_JIG EE_GLR RANDOM Dana 0.827 0.811 0.066 0.068 0.062 0.463 0.083 0.650 0.647 0.082 0.628 0.730 0.069 0.636 0.509 Dmoj 0.801 0.854 0.132 0.135 0.147 0.543 0.166 0.660 0.667 0.121 0.603 0.702 0.144 0.647 0.497 Dpse 0.811 0.773 0.371 0.370 0.405 0.574 0.410 0.581 0.689 0.305 0.567 0.677 0.347 0.618 0.506 Dvir 0.786 0.843 0.139 0.143 0.135 0.625 0.177 0.657 0.613 0.141 0.582 0.659 0.145 0.655 0.502 Dyak 0.784 0.784 0.078 0.078 0.065 0.468 0.089 0.554 0.588 0.102 0.504 0.607 0.074 0.541 0.500 Dsim 0.768 0.742 0.276 0.255 0.224 0.572 0.286 0.543 0.560 0.187 0.414 0.602 0.227 0.542 0.493 ---- mean 0.796 0.801 0.177 0.175 0.173 0.541 0.202 0.608 0.627 0.156 0.550 0.663 0.168 0.607 0.501 Specificity to exons with evidence DL_SNP DL_SNO EE_CEX EE_CGM EE_CGW NI_GNO PH_GMP BN_NSC RI_GID OD_GPI BZ_CNA EE_GLN MD_JIG EE_GLR RANDOM Dana 0.548 0.565 0.945 0.958 0.970 0.795 0.947 0.798 0.746 0.943 0.833 0.760 0.959 0.818 0.278 Dmoj 0.531 0.513 0.941 0.951 0.964 0.868 0.941 0.821 0.786 0.946 0.891 0.837 0.950 0.872 0.272 Dpse 0.590 0.620 0.947 0.960 0.971 0.876 0.949 0.877 0.808 0.962 0.896 0.867 0.959 0.893 0.322 Dvir 0.516 0.541 0.941 0.951 0.967 0.837 0.941 0.816 0.786 0.947 0.885 0.830 0.950 0.858 0.261 Dyak 0.605 0.613 0.948 0.963 0.971 0.861 0.956 0.845 0.805 0.953 0.903 0.845 0.966 0.883 0.329 Dsim 0.595 0.602 0.947 0.963 0.969 0.818 0.955 0.811 0.801 0.942 0.870 0.823 0.963 0.837 0.317 ---- mean 0.564 0.576 0.945 0.958 0.969 0.842 0.948 0.828 0.789 0.949 0.880 0.827 0.958 0.860 0.296 ------------------- * 31May: BATZ_CNA dataset updated. * 1Jun: Changed measured prediction feature from exon to CDS for BREN_NSC, PACH_GMP, NCBI_GNO, as evidence is probably measuring only coding regions, and these three methods predict non-coding exons as well as CDS. also removed cross-species co-prediction (PRED_CROSSMATCH) as new gene evidence, as it was a partial analysis, and may have bias * 17Jul: added UMD Jigsaw, EISE Glean prediction combiner statistics * 14Aug: added EISE GleanR recombiner Statistics: Sensitivity to Old/New genes = ( predicted.overlaps.evidence / total.evidence.exons) Specificity = 1 - ( no_evidence.predicted.exons / total.predicted.exons ) "Old genes" evidence = HSP_modDM (Dmel protein matches; this does however include 2ndary matches to potential new genes) "New genes" evidence = OLIV_EXP - HSP_modDM Removing HSP_modDM, OLIV_EXP yields around 10,000 exon locations/species Evidence kinds: HSP_modDM : Dmel protein blast match exons. These include secondary matches, along with any best gene match. OLIV_EXP : Oliver et al. Expression data PRED_CROSSMATCH : reciprocal gene prediction match (not used in statistics) (2+ species with same predicted exons using transcript x genome megablast) Prediction groups and data from exons of DGIL_SNP, DGIL_SNO, EISE_CEX, EISE_CGM, EISE_CGW ( == coding exons) CDS of NCBI_GNO, PACH_GMP, BREN_NSC, RGUI_GID, OXFD_GPI, BATZ_CNA, UMD_JIG, EISE_GLN RANDOM : randomly generated exon predictions (0.5 chance at any 1kb genome window). from http://rana.lbl.gov/drosophila/wiki/index.php/Annotation_Submission where predictions included exon and CDS, exon included all CDS. The species subset here are those with OLIV_EXP data. Predictions on small scaffolds under 20 KB were removed. Predictor Key: DL_SNP: DGIL_SNP: SNAP predictor, no homology DL_SNO: DGIL_SNO: SNAP predictor using homology evidence EE_CEX: EISE_CEX: Exonerate (Dmel gene mapping) EE_CGM: EISE_CGM: GeneMapper (Dmel gene mapping) EE_CGW: EISE_CGW: GeneWise (Dmel gene mapping) NI_GNO: NCBI_GNO: GNOMON predictor using homology evidence PH_GMP: PACH_GMP: GeneMapper (Dmel gene mapping) BN_NSC: BREN_NSC: N-SCAN predictor using homology evidence RI_GID: RGUI_GID: geneid predictor OD_GPI: OXFD_GPI: Oxford gene pipeline (exonerate with Dmel gene mapping) BZ_CNA: BATZ_CNA: CONTRAST predictor MD_JIG: UMD_JIG : Jigsaw prediction combiner EE_GLN: EISE_GLN: Glean prediction combiner EE_GLR: EISE_GLR: Glean recombined with higher weight to gene mappers Total gene prediction counts: ------ Predictors -------- Species DL_SNP DL_SNO EE_CEX EE_CGM EE_CGW NI_GNO PH_GMP BN_NSC RI_GID OD_GPI BZ_CNA MD_JIG EE_GLN EE_GLR dmel 20416 23937 NA NA NA 15538 NA NA 12671 NA 14224 NA NA NA dsim 28265 24623 35273 18707 19180 17733 12843 19004 13797 12422 15530 12632 19397 18273 dsec 36782 31358 34854 19110 19255 22787 13342 18428 29159 14423 15887 13246 29824 21332 dyak 32761 32029 34716 19057 19210 18313 13307 19513 15647 14290 16923 13163 22529 19430 dere 25198 24654 33587 19063 19211 16663 13401 16753 18654 13568 14863 12764 17880 16881 dana 41918 41007 33504 18602 18909 23496 12473 20519 30348 12920 18891 12018 33411 22551 dper 38209 36086 35069 18056 18625 24472 11725 19924 28798 11924 18611 11895 34794 23029 dpse 28911 26243 34828 18211 18690 18910 11932 17250 19060 11690 16158 11991 19894 17328 dwil 53625 50309 34987 18184 18583 24738 11546 16963 25279 10780 12930 11079 34112 20257 dmoj 34235 39387 33573 17958 18449 17785 11342 19052 19994 10902 14779 11264 22451 17739 dvir 35378 35285 33940 18054 18485 18438 11443 18444 26095 11117 14476 11142 19251 17684 dgri 34975 34420 34389 17840 18391 17758 11249 17566 29516 11052 14003 11350 26971 16901 ---- mean 34223 33278 34429 18440 18817 19719 12237 18492 22418 12281 15606 12049 25501 19219 Species mean dmel dsim dsec dyak dere dana dper dpse dwil dmoj dvir dgri 17357 18933 22393 20458 18606 23810 23158 19162 24108 21192 20407 20931 Prediction set 31May06 Notes: EISE_CGM, EISE_CGW 'gene' feature is mRNA; using trans_id=FBpp as count