I have a list of >100 circular contigs that I would like to remove from my de novo genome assembly.fasta. How can I remove these contigs from the assembly.fasta using a text file with the contig names/numbers? Or is there another way?
I have a list of >100 circular contigs that I would like to remove from my de novo genome assembly.fasta. How can I remove these contigs from the assembly.fasta using a text file with the contig names/numbers? Or is there another way?
If you already know which contigs are circular, you can use the really cool seqkit
tool. The grep
subcommand is the perfect tool for this job.
seqkit grep assembly.fasta -n -v -f circular_contigs.txt > assembly_clean.fasta
-n
specifies to match by full name not just by id pattern (this means the names need to match 100%)
-v
inverts the search criteria (i.e. anything that's not circular)
-f
specifies the file by which to look for patterns (in this case the circular contig header names)
circular_contigs.txt
is a list (one header per line) that identifies the circular contigs to be removed
> assembly_clean.fasta
seqkit outputs to the terminal (stdin) so this last bit is piping into a new file
More info here: https://bioinf.shenwei.me/seqkit/
Hope that helps
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.