Hi all,
Can anyone tell me how to remove contigs below 201 nucleotides in length from a genome assembly for submission to NCBI - genome is currently in fasta format and created with SOAPdenovo2
Many thanks
Hi all,
Can anyone tell me how to remove contigs below 201 nucleotides in length from a genome assembly for submission to NCBI - genome is currently in fasta format and created with SOAPdenovo2
Many thanks
If you want to do it CLI, there is a program called bioawk
(see here) that can do this job easily!
bioawk -c fastx 'length($seq) >200 {print $name"\n"$seq}' scaffolds.fasta
if you want to print it pretty, then pipe it through fold
command!
here, -c
is for specifying the input file type (fastx for fasta, fastq etc)
filtering criteria is specified as length($seq) >200
which is straight forward
and to print the fasta sequence back print $name"\n"$seq}
Hope this helps!
If you have limited bioinformatics skills, I recommend using Bioedit. With it you can sort your contigs by length, and manually exclude those not desirable. Otherwise, you can use this fantastic toolkit: FAST (https://github.com/tlawrence3/FAST).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This command line leaves the sequence names without ">" what corrupts the file. Just modify according to bioawk example:
bioawk -c fastx 'length($seq) >200 {print ">"$name"\n"$seq}' scaffolds.fasta