::::::::::::::::::::::::::::::::::::: dspp assembly blast annotation steps: ===================================== d.gilbert, gilbertd@indiana.edu, 2005, for insects.eugenes.org work 1. data prep: scaffold/chr fasta to -- add ncbi fa header w/ blast_id = genoblastformat.pl -- shred, randomize fasta for grid = genoblastformat.pl ./genoblastformat.pl -chromosome -flattenfasta -in=dana_ag01aug05.fa.gz -blid=dana_ag0508 -- make ncbi db for grid (with blast ids -o T) -- make chromosome/scaff blast index (for drospege; -o F) genoblastformat.pl -dnablast -in=dana_ag01aug05.fa.gz -blid=dana_ag0508 -out=dana-chromosome 2. blasts: dxxx-dmelchr dna blastn: local grid, ~4 hr on 8cpu; dxxxdmelc.rsl dxxx-prot9 prot tblastn: tgrid, 6-12 hr on 64 cpu, ~300k query prots; dxxxprot9.rsl dmicsat-dxxx dna blastn: local grid, <1 hr 8cpu; dmicsatdxxxc.rsl .. other dros+ source data sets? cdna/est, transposons, oligos, ... 3. gff outputs: -- need to filter, rewrite, etc. blastout a. dxxx-dmel-dna.gff from dxxxdmelc.blout (not used for map views) a1. dxxx-dmel-algn.gff subset of dxxx-dmel-dna.gff (for map views) a2. dmel-dxxx-dna.gff -> dmel-dxxx-algn.gff (swap query,subject of a.) b. dxxx-prot9.gff from dxxxprot9.blout b1. dxxx-markers.gff subset of marker dmel genes from dxxx-prot9.gff c. dxxx-scaffolds.gff -> gff.toplevel ( from source fasta ) d. dros-dxxx-micsat.gff from dmicsat-dxxx.blout Transforms to gff: set soc1=/bio/bio-grid/shared/out/chrs1 set sc=/bio/bio-grid/shared/chrs/ set dp=dxxx; set dpid=dxxx_ag0000... 3a. # per dxxx species genome # add -swap to get source == dspp (scaffolds); -out=dxxx-dmel foreach dp ($dpp) echo ${dp}-dmel-dna.gff ... gtar -Ozxf $soc1/${dp}dmelc.blout.tgz |\ ../blast92gff.pl -swap_q2s -in=stdin -gff -deshredquery -add_match -uniq -nodebug \ -chrin=$sc/dmel/dmel-chromosomes.gff \ -qdb=${dpid} -out=${dp}-dmel-dna.gff end # for dmel genome # the source is dmel (chr arms); -noswap -out=dmel-dxxx gtar -Ozxf $soc1/${dp}dmelc.blout.tgz |\ ../blast92gff.pl -noswap -in=stdin -gff -deshredquery -add_match -uniq -nodebug \ -chrin=$sc/$dp/${dp}-scaffolds.gff \ -qdb=${dpid} -out=dmel-${dp}-dna.gff 3a1. $bg/blast/chrgff2align.pl -stats -in=${dp}-dmel-dna.gff -out=${dp}-dmel-algn.gff $bg/blast/chrgff2align.pl -stats -in=dmel-${dp}-dna.gff -out=dmel-${dp}-algn.gff # -stats added to give summary of dmel dna cover 3b. gtar -Ozxf $soc1/${dp}prot9.blout.tgz |\ ../blast92gff.pl -gff -in=stdin -out=${dp}-prot9.gff -nodebug # problem here, filter out low-qual 2ndary matches if have best match prot? or not? # look at match bitscore; filter if best >> 2nd; cat $dp-prot9.gff | perl ../refilterprot9b.pl > $dp-prot9b.gff 3b1. egrep 'modDM|#' $dp-prot9.gff | grep -v match_part | perl -ne\ 'BEGIN{open(F,"'$sc/dmel/dmel-best-gene3.ids'") or die "dmel-best-gene3.ids";\ %best=map{chomp;($d,$n)=split;$d,$n;} ;}\ ($id)=m/ID=(\w+)/; if(/^#/){print} elsif($best{$id}){ s/tblastn:modDM/marker:modDM/; \ s/ID=[^;]+;/ID=$best{$id};/; s/;loc=[^;\n]+//; s/[\n]/;id2=$id\n/ unless(/$id/); print;}' \ > $dp-markers.gff ## NOTE: GFF MUST HAVE '##gff-version:3' metacomment Check dups - got a few valid (bitscore ~ same): cat $dp-markers.gff | perl -ne'm/ID=([^;]+)/;$id{$1}++;END{foreach $d (sort{$id{$b}<=>$id{$a}} keys %id){print "$d\t$id{$d}\n";}}' | more 3b2. use stringent match filter of blast hits: gtar -Ozxf $soc1/${dp}prot9.blout.tgz |\ ../blast92gff.pl -evalmin=1e-10 -stringent -gff -in=stdin \ -out=${dp}-prot9-strong2.gff -nodebug & 3c. gunzip -c ${dp}_ag01aug05.fa.gz |\ blast92gff.pl -chrgff -qdb=${dpid} -in=stdin -out=${dp}-scaffold.gff 3d. gtar -Ozxf $soc1/dmicsat${dp}c.blout.tgz | \ ../blast92gff.pl -in=stdin -gff -out=stdout -nodebug -evalmin=1e-10 | \ perl -pe's,blastn:\w+,blastn:drosmicsat,;' > dros-${dp}-micsat.gff =====================================