Pea aphid genome data release policy from Baylor College of Medicine
0. What's New 1. Introduction 2. Conditions for use 3. Description of files 4. Sequence statistics 5. Read statistics 6. Other resources 7. History 0. What's New Acyr_1.0 is a preliminary assembly of the pea aphid, Acyrthosiphon pisum, using whole genome shotgun (WGS) reads from small insert clones as well as BAC end sequences. 1. Introduction This information is for the first release (Acyr_1.0) of the draft genome sequence of the pea aphid, Acyrthosiphon pisum. This is a draft sequence and may contain errors so users should exercise caution. Typical errors in draft genome sequences include mis-assemblies of repeated sequences, collapses of repeated regions, and unmerged overlaps (e.g. due to polymorphisms) creating artificial duplications. However base accuracy in contigs (contiguous blocks of sequence) is usually very high with most errors near the ends of contigs. The Acyr_1.0 release was produced by assembling whole genome shotgun reads with the Atlas genome assembly system at the Baylor College of Medicine Human Genome Sequencing Center. Several WGS libraries, with inserts of 2-3 kb, 4-5 kb and ~130kb were used to produce the data. Approximately 3.13 million reads were assembled, representing about 464Mb of sequence and about 6.2X coverage of the (clonable) A. pisum genome. BAC end sequences were included for scaffolding. So far none of these sequences have been anchored to chromosomes, as such it is unknown which chromosomes these sequences come from. Aphids for DNA isolation were from a clone, LSR1.AC.F1, resulting from a single generation of inbreeding of clone LSR1. The aphids were then treated with ampicillin to eliminate the facultative symbiont, Regiella insecticola. Prior to DNA preparation aphids were heat treated to reduce the number of primary symbionts, Buchnera aphidicola. Entire aphid colonies on broad bean plants were placed in a 30°C incubator for 4 days. Quantification of levels of Buchnera DNA revealed a significant decrease in the level of Buchnera. Approximately 2% of the sequencing reads came from the Buchnera genome and were removed prior to assembly The products of the Atlas assembler are a set of contigs (contiguous blocks of sequence) and scaffolds. Scaffolds include sequence contigs that can be ordered and oriented with respect to each other as well as isolated contigs that could not be linked (single contig scaffolds or singletons). Reads which clustered into groups of 3 or fewer were not assembled and are found in the collection of reads called bin0. The N50 size of the contigs is 10.7 kb and the N50 of the scaffolds is 88.5 kb. The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer. The total length of all contigs is 464 Mb. When the gaps between contigs in scaffolds are included, the total span of the assembly is 466 Mb (some scaffolds with large gaps may artificially increased the assembly size). The assembly Acyr_1.0 was tested against available A. pisum sequence data sets (EST sequences and finished BAC sequences) for extent of coverage (completeness). When assembled contigs were tested, over 84% of the sequences in these data sets were found to be represented, indicating that the shotgun libraries used to sequence the genome were comprehensive and the assembly covered most of the cloned genome. Of 67,294 EST sequences 85% were contained in the assembled contigs. The quality of the assembly was also tested by aligning 13 finished BACs (a total of 1,371.6 kb) to the assembly. The assembled genomic coverage of the BACs was low, between 20% and 80% with an average of approximately 70% of the BAC sequence in the assembly. The proportion of overlapping bases in all the aligned sequences was approximately 9%, possibly suggesting the existence of some polymorphism within the isolated DNA or other assembly problems. We hope to rectify these quality issues with a future assembly 2. Conditions for use These data are made available before scientific publication with the following understanding: - The data may be freely downloaded, used in analyses, and repackaged in databases. - Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of this data (Baylor College of Medicine Human Genome Sequencing Center) are properly acknowledged. Please cite the BCM-HGSC web site or publications from BCM-HGSC referring to the genome sequence. - The BCM-HGSC plans to publish the assembly and genomic annotation of the dataset, including large-scale identification of regions of evolutionary conservation. - This is in accordance with, and with the understandings in the Fort Lauderdale meeting discussing Community Resource Projects (see http://www.genome.gov/Pages/Research/WellcomeReport0303.pdf) and the resulting NHGRI policy statement (http://www.genome.gov/page.cfm?pageID=10506537). - Any redistribution of the data should carry this notice. 3. Description of files There are 4 directories and one file. I. contigs/ directory This directory has 3 files for assembled contigs Acyr20071212-contigs.fa Acyr20071212-contigs.fa.qual Acyr20071212.agp The agp file describes the positions and orientations of the contigs. It takes the standard NCBI format (http://www.ncbi.nlm.nih.gov/Genbank/WGS.agpformat.html). II. linear_scaffolds/ directory This directory has 2 files for AGP sequences Acyr20071212-genome.fa Acyr20071212-genome.fa.qual III. unassembled_reads/ directory This directory has one fasta file and its corresponding quality file for bin0 and highly repetitive reads. Those reads which were in clusters (created by Atlas_Overlapper) of 3 or fewer reads are collectively called bin0 reads. Some reads were too repetitive for assembly, and thus also were not assembled. The files are: Acyr20071212-unassembled_reads.fa Acyr20071212-unassembled_reads.fa.qual IV. blast/ directory This directory contains blast databases of the sequence files described above. V. README - file This file. 4. Sequence and scaffold statistics Scaffold and Contig statistics Scaffolds/Contigs Number N50(kb) Total length (Mb) Scaffolds 22,801 86.9 464.3 Contigs 72,844 10.8 446.6 5. READ STATISTICS insert size raw reads passed reads asm reads clone 2-5 kb 4,325,313 3,955,990 3,044,414 plasmid 35 kb 24,673 8,158 5,294 fosmid 110 -130 kb 56,246 45,140 2,286 BAC Total 4,406,232 4,009,288 3,051,994 6. OTHER RESOURCES In addition to the files described in section 2, the HGSC website has BLAST servers for searching the assembled contigs, linear scaffolds and unassembled reads. (http://www.hgsc.bcm.tmc.edu/blast.hgsc?organism=2) 7. History Acry_1.0 (December 2007) is the first, draft assembly of the pea aphid, Acyrthosiphon pisum genome.