Hello,
I have a FASTA file, and I want to extract a subsequence from it.
>first
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>second
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA
I know I can do this with samtools:
samtools faidx seq.fa first:1-5
output is:
>first:1-5
AGCAT
However, my actual sequence names are much more complicated, and anything other than one word doesn't seem to work.
If seq2.fa
is:
>first seq
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>second seq
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA
Then the output of:
samtools faidx seq2.fa first:1-5
>first:1-5
seqAG
``` samtools faidx seq2.fa first seq:1-5
first seqAGCATGCATCGTACGATCGTACGATCATCATCGTACGATCGATCGATCGATCGATCG TGTGCTAGACGATCGATCGAT seq:1-5 ```
samtools faidx seq2.fa 'first seq':1-5
>first seq:1-5
Also, if my sequence description contains a number (like seq3.fa
):
>seq123
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>seq456
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA
samtools faidx seq3.fa seq123:1-5
>seq123:1-5
Is there a way to use samtools to extract a subsequence when the descriptor is more complicated than just a word? I've tried with samtools 0.1.18 and 0.1.19.
Thanks
I tried that, but no luck:
Please show me the output of
grep first_seq seq4.*
Just:
If you really ran
grep first_seq seq4.*
, then you didn't correctly indexed the fasta sequence.I think the blank lines are the issue, and samtools should have warned when the index creation was attempted. The output of
samtools faidx seq4.fa
may be informative.Yep--I didn't index it at all. Silly error by me. Thanks for the help.