I was hoping to do a quick cut -f2 on the output of samtools faidx command so I can have the length of FASTA sequences, but realized (for good reasons) that it ignores duplicate sequences. Is there a way to force samtools faidx to get the length of each sequence in a big FASTA file, where some sequences and header names are identical? If not, is there a tool in samtools or prepackaged elsewhere that can do this for me?
I'm fairly confused about what you're trying to do here and why it doesn't work. Why won't the second column of output from faidx give you the length of each contig? http://www.htslib.org/doc/faidx.html Can you give an example of input and expected output?
I'm just trying to get the length of each sequence in a big FASTA file, where some sequences and header names are identical. For example, samtools faidx my.fasta output:
[fai_build_core] ignoring duplicate sequence "locus12_sample142" at byte offset 25308298
The correct answer then seems to be that you should alter your fasta file to no longer have duplicate sequences/headers. Alternately, you can create your index file, then loop back through your fasta headers and spit out the length for each, since there will be a matching entry in the index file.
I don't use this forum much, should I update the question with my response?
Pierre is asking you to go back to old questions and either upvote and accept the answer that solved your problem, or to add an answer saying what worked for you. If you don't provide that feedback, it's less useful to other people that may come across your answers later. Same goes here.
Please validate or comment on your previous questions: