Question

How to remove contigs below 201 nuclotide from a genome assembly?

1

Entering edit mode

7.8 years ago

jt358 ▴ 10

Hi all,

Can anyone tell me how to remove contigs below 201 nucleotides in length from a genome assembly for submission to NCBI - genome is currently in fasta format and created with SOAPdenovo2

Many thanks

genome Assembly • 3.2k views

ADD COMMENT • link updated 7.8 years ago by arnstrm ★ 1.9k • written 7.8 years ago by jt358 ▴ 10

score 2 · Answer 1 · 2017-03-05

2

Entering edit mode

7.8 years ago

arnstrm ★ 1.9k

If you want to do it CLI, there is a program called bioawk (see here) that can do this job easily!

 bioawk -c fastx 'length($seq) >200 {print $name"\n"$seq}' scaffolds.fasta

if you want to print it pretty, then pipe it through fold command!

here, -c is for specifying the input file type (fastx for fasta, fastq etc) filtering criteria is specified as length($seq) >200 which is straight forward and to print the fasta sequence back print $name"\n"$seq}

Hope this helps!

ADD COMMENT • link 7.8 years ago by arnstrm ★ 1.9k

0

Entering edit mode

This command line leaves the sequence names without ">" what corrupts the file. Just modify according to bioawk example:

bioawk -c fastx 'length($seq) >200 {print ">"$name"\n"$seq}' scaffolds.fasta

ADD REPLY • link 6.5 years ago by veljokisand • 0

score 0 · Answer 2 · 2017-03-04

0

Entering edit mode

7.8 years ago

fhsantanna ▴ 620

If you have limited bioinformatics skills, I recommend using Bioedit. With it you can sort your contigs by length, and manually exclude those not desirable. Otherwise, you can use this fantastic toolkit: FAST (https://github.com/tlawrence3/FAST).

ADD COMMENT • link 7.8 years ago by fhsantanna ▴ 620