extract single contig from fasta file based on name?
1
0
Entering edit mode
4.1 years ago

I would like to extract a single contig from a fasta file, and I have many fasta files and contigs I need to do this with. Note the fasta files have different names and the contigs have different names for each scenario. I know I can use seqtk with a list, but building a list for each assembly is a pain because there are so many, and I am only looking to pull one contig from each assembly. Does anyone know of an easy way to do this (without having to make a separate list of 1 contig for each assembly). I just want to name the single contig in the code. Any help is appreciated!

Assembly • 6.1k views
ADD COMMENT
1
Entering edit mode

Please provide input and output examples. Probably samtools faidx is the answer.

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode

One way is to linearise all the contigs so they are contained within a single line (incase they are not) awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | grep "\S" and then use grep -A1 with the contig name to grab the line with the name and then the contig that follows in the next line

ADD REPLY
0
Entering edit mode

If I was able to extract a single contig based on its name with seqtk it would look like this:

seqtk subseq in.fq contig00001 > out.fq

but I cannot do that because it actually requires a list, and must look like this:

seqtk subseq in.fq name.lst > out.fq

Given that I have hundreds of fasta files all that need I a single contig extracted, making a list for each is a pain, so assuming subseq worked as presented in the original example I would want something like:

seqtk subseq in1.fq contig00001 > out.fq

seqtk subseq in2.fq contig00004 > out.fq

seqtk subseq in3.fq contig00008 > out.fq

etc.

Make sense?

ADD REPLY
0
Entering edit mode

No, makes no sense because you still do not provide any example data. We have no idea how your contig file looks like.

ADD REPLY
0
Entering edit mode

Its not a file, its just the name of the contig. I am looking for way to do this with just the name of a contig rather than using a file, that is the whole point of this post.

ADD REPLY
0
Entering edit mode

Considering you are looking for contigs I am going to assume that you mean '.fa' and not '.fq' (as stated in the original question).

How do you know which contig you want from each multi-fasta file? Do you have a file with the desired contig and it's corresponding multi-fasta file?

ADD REPLY
0
Entering edit mode
4.1 years ago
harishk0201 ▴ 130

The easiest way is to do the following, but ofcourse as ATpoint points out, we don't know how your contig headers look, so that may be an issue. The easiest way is however below:

printf "contigid\n" | seqtk subseq contigs.fasta - > contigid.fasta

ADD COMMENT

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6