How to remove contigs below 201 nuclotide from a genome assembly?
2
1
Entering edit mode
7.7 years ago
jt358 ▴ 10

Hi all,

Can anyone tell me how to remove contigs below 201 nucleotides in length from a genome assembly for submission to NCBI - genome is currently in fasta format and created with SOAPdenovo2

Many thanks

genome Assembly • 3.2k views
ADD COMMENT
2
Entering edit mode
7.7 years ago
arnstrm ★ 1.9k

If you want to do it CLI, there is a program called bioawk (see here) that can do this job easily!

 bioawk -c fastx 'length($seq) >200 {print $name"\n"$seq}' scaffolds.fasta

if you want to print it pretty, then pipe it through fold command!

here, -c is for specifying the input file type (fastx for fasta, fastq etc) filtering criteria is specified as length($seq) >200 which is straight forward and to print the fasta sequence back print $name"\n"$seq}

Hope this helps!

ADD COMMENT
0
Entering edit mode

This command line leaves the sequence names without ">" what corrupts the file. Just modify according to bioawk example:

bioawk -c fastx 'length($seq) >200 {print ">"$name"\n"$seq}' scaffolds.fasta

ADD REPLY
0
Entering edit mode
7.7 years ago
fhsantanna ▴ 620

If you have limited bioinformatics skills, I recommend using Bioedit. With it you can sort your contigs by length, and manually exclude those not desirable. Otherwise, you can use this fantastic toolkit: FAST (https://github.com/tlawrence3/FAST).

ADD COMMENT

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6