I would like to extract a single contig from a fasta file, and I have many fasta files and contigs I need to do this with. Note the fasta files have different names and the contigs have different names for each scenario. I know I can use seqtk with a list, but building a list for each assembly is a pain because there are so many, and I am only looking to pull one contig from each assembly. Does anyone know of an easy way to do this (without having to make a separate list of 1 contig for each assembly). I just want to name the single contig in the code. Any help is appreciated!
Please provide input and output examples. Probably
samtools faidx
is the answer.See for some ideas: C: How do I extract Fasta Sequences based on a list of IDs?
One way is to linearise all the contigs so they are contained within a single line (incase they are not)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | grep "\S"
and then usegrep -A1
with the contig name to grab the line with the name and then the contig that follows in the next lineIf I was able to extract a single contig based on its name with seqtk it would look like this:
seqtk subseq in.fq contig00001 > out.fq
but I cannot do that because it actually requires a list, and must look like this:
seqtk subseq in.fq name.lst > out.fq
Given that I have hundreds of fasta files all that need I a single contig extracted, making a list for each is a pain, so assuming subseq worked as presented in the original example I would want something like:
seqtk subseq in1.fq contig00001 > out.fq
seqtk subseq in2.fq contig00004 > out.fq
seqtk subseq in3.fq contig00008 > out.fq
etc.
Make sense?
No, makes no sense because you still do not provide any example data. We have no idea how your contig file looks like.
Its not a file, its just the name of the contig. I am looking for way to do this with just the name of a contig rather than using a file, that is the whole point of this post.
Considering you are looking for contigs I am going to assume that you mean '.fa' and not '.fq' (as stated in the original question).
How do you know which contig you want from each multi-fasta file? Do you have a file with the desired contig and it's corresponding multi-fasta file?