Length of FASTA lines
1
0
Entering edit mode
8.8 years ago

I was hoping to do a quick cut -f2 on the output of samtools faidx command so I can have the length of FASTA sequences, but realized (for good reasons) that it ignores duplicate sequences. Is there a way to force samtools faidx to get the length of each sequence in a big FASTA file, where some sequences and header names are identical? If not, is there a tool in samtools or prepackaged elsewhere that can do this for me?

fasta samtools • 3.3k views
ADD COMMENT
0
Entering edit mode

I'm fairly confused about what you're trying to do here and why it doesn't work. Why won't the second column of output from faidx give you the length of each contig? http://www.htslib.org/doc/faidx.html Can you give an example of input and expected output?

ADD REPLY
0
Entering edit mode

I'm just trying to get the length of each sequence in a big FASTA file, where some sequences and header names are identical. For example, samtools faidx my.fasta output:

[fai_build_core] ignoring duplicate sequence "locus12_sample142" at byte offset 25308298

ADD REPLY
0
Entering edit mode

The correct answer then seems to be that you should alter your fasta file to no longer have duplicate sequences/headers. Alternately, you can create your index file, then loop back through your fasta headers and spit out the length for each, since there will be a matching entry in the index file.

ADD REPLY
0
Entering edit mode

I don't use this forum much, should I update the question with my response?

ADD REPLY
0
Entering edit mode

Pierre is asking you to go back to old questions and either upvote and accept the answer that solved your problem, or to add an answer saying what worked for you. If you don't provide that feedback, it's less useful to other people that may come across your answers later. Same goes here.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
8.8 years ago

The BBMap package has a tool called "readlength.sh" which will generate a histogram of the lengths of sequences from a fasta file. It does not care whether names are duplicated. Perhaps that's what you're looking for?

ADD COMMENT

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6