Pea aphid genome data release policy from Baylor College of Medicine
0. What's New
1. Introduction
2. Conditions for use
3. Description of files
4. Sequence statistics
5. Read statistics
6. Other resources
7. History

0. What's New

Acyr_1.0 is a preliminary assembly of the pea aphid, Acyrthosiphon pisum,
using whole genome shotgun (WGS) reads from small insert clones as well
as BAC end sequences.

1. Introduction

This information is for the first release (Acyr_1.0) of the draft
genome sequence of the pea aphid, Acyrthosiphon pisum. This is a
draft sequence and may contain errors so users should exercise
caution. Typical errors in draft genome sequences include
mis-assemblies of repeated sequences, collapses of repeated regions,
and unmerged overlaps (e.g. due to polymorphisms) creating artificial
duplications. However base accuracy in contigs (contiguous blocks of
sequence) is usually very high with most errors near the ends of
contigs.

The Acyr_1.0 release was produced by assembling whole genome shotgun
reads with the Atlas genome assembly system at the Baylor College of
Medicine Human Genome Sequencing Center. Several WGS libraries, with
inserts of 2-3 kb, 4-5 kb and ~130kb were used to produce the data. 
Approximately 3.13 million reads were assembled, representing about 
464Mb of sequence and about 6.2X coverage of the (clonable) A. pisum 
genome. BAC end sequences were included for scaffolding. So far none of 
these sequences have been anchored to chromosomes, as such it is unknown 
which chromosomes these sequences come from. 

Aphids for DNA isolation were from a clone, LSR1.AC.F1, resulting from a 
single generation of inbreeding of clone LSR1. The aphids were then treated 
with ampicillin to eliminate the facultative symbiont, Regiella insecticola. 
Prior to DNA preparation aphids were heat treated to reduce the number of 
primary symbionts, Buchnera aphidicola. Entire aphid colonies on broad bean 
plants were placed in a 30°C incubator for 4 days. Quantification of levels 
of Buchnera DNA revealed a significant decrease in the level of Buchnera. 
Approximately 2% of the sequencing reads came from the Buchnera genome and 
were removed prior to assembly


The products of the Atlas assembler are a set of contigs (contiguous
blocks of sequence) and scaffolds. Scaffolds include sequence contigs
that can be ordered and oriented with respect to each other as well as
isolated contigs that could not be linked (single contig scaffolds or
singletons). Reads which clustered into groups of 3 or fewer were not
assembled and are found in the collection of reads called bin0. The
N50 size of the contigs is 10.7 kb and the N50 of the scaffolds is
88.5 kb. The N50 size is the length such that 50% of the assembled
genome lies in blocks of the N50 size or longer.

The total length of all contigs is 464 Mb. When the gaps between
contigs in scaffolds are included, the total span of the assembly is
466 Mb (some scaffolds with large gaps may artificially increased the
assembly size).

The  assembly Acyr_1.0 was tested against available A. pisum sequence
data sets (EST sequences and finished BAC sequences) for extent of coverage 
(completeness). When assembled contigs were tested, over 84% of
the sequences in these data sets were found to be represented,
indicating that the shotgun libraries used to sequence the genome were
comprehensive and the assembly covered most of the cloned genome.

Of 67,294 EST sequences 85% were contained in the assembled contigs. 
The quality of the assembly was also tested by aligning 13 finished BACs (a 
total of 1,371.6 kb) to the assembly. The assembled genomic coverage of the 
BACs was low, between 20% and 80% with an average of approximately 70% of the 
BAC sequence in the assembly.  The proportion of overlapping bases in all the 
aligned sequences was approximately 9%, possibly suggesting the existence of 
some polymorphism within the isolated DNA or other assembly problems. We hope 
to rectify these quality issues with a future assembly

2. Conditions for use

These data are made available before scientific publication with the
following understanding:

- The data may be freely downloaded, used in analyses, and repackaged
in databases.

- Users are free to use the data in scientific papers analyzing
particular genes and regions if the providers of this data (Baylor
College of Medicine Human Genome Sequencing Center) are properly
acknowledged.  Please cite the BCM-HGSC web site or publications from
BCM-HGSC referring to the genome sequence.

- The BCM-HGSC plans to publish the assembly and genomic annotation of
the dataset, including large-scale identification of regions of
evolutionary conservation.

- This is in accordance with, and with the understandings in the Fort
Lauderdale meeting discussing Community Resource Projects (see
http://www.genome.gov/Pages/Research/WellcomeReport0303.pdf) and the resulting
NHGRI policy statement
(http://www.genome.gov/page.cfm?pageID=10506537).

- Any redistribution of the data should carry this notice.

3. Description of files

There are 4 directories and one file.

I. contigs/ directory

This directory has 3 files for assembled contigs

Acyr20071212-contigs.fa
Acyr20071212-contigs.fa.qual
Acyr20071212.agp

The agp file describes the positions and orientations of
the contigs.  It takes the standard NCBI format
(http://www.ncbi.nlm.nih.gov/Genbank/WGS.agpformat.html).


II. linear_scaffolds/ directory

This directory has 2 files for AGP sequences

Acyr20071212-genome.fa 
Acyr20071212-genome.fa.qual


III. unassembled_reads/ directory

This directory has one fasta file and its corresponding quality file
for bin0 and highly repetitive reads. Those reads which were in clusters 
(created by Atlas_Overlapper) of 3 or fewer reads are collectively called 
bin0 reads. Some reads were too repetitive for assembly, and thus also 
were not assembled. The files are:

Acyr20071212-unassembled_reads.fa
Acyr20071212-unassembled_reads.fa.qual

IV. blast/ directory

This directory contains blast databases of the sequence files described above.

V. README - file
This file.



4. Sequence and scaffold statistics

Scaffold and Contig statistics

Scaffolds/Contigs	Number  	N50(kb) 	Total length (Mb)
Scaffolds           	22,801		86.9		464.3     
Contigs             	72,844		10.8		446.6

5. READ STATISTICS
insert size    	raw reads  	passed reads 	asm reads 	clone
2-5 kb        	4,325,313	3,955,990    	3,044,414     	plasmid
35 kb    	24,673       	8,158		5,294		fosmid
110 -130 kb 	56,246 	 	45,140	  	2,286		BAC
Total 	     	4,406,232 	4,009,288	3,051,994

6. OTHER RESOURCES

In addition to the files described in section 2, the HGSC website has BLAST
servers for searching the assembled contigs, linear scaffolds and unassembled reads.
(http://www.hgsc.bcm.tmc.edu/blast.hgsc?organism=2)

7. History

Acry_1.0 (December 2007) is the first, draft assembly of the pea aphid, Acyrthosiphon pisum genome.