I have a file with lots of FASTA sequences that I downloaded from NCBI.
I wish to confirm the number of sequences that are present in the file. For the same, I tried counting ">" for obvious reasons using following command
fgrep -o > sequence.fasta | wc -l
However, I am not getting the desired result. Can somebody suggest me the way to do so.
Secondly, though I wish to work with Human sequences, however, my file contains some other sequences of some other organisms as well. Can somebody share command where we can just retain sequences with title Human and delete the remaining.
This doesn't work because the > is a special character on the command-line; it redirects standard output into a file. You have to use single or double quotations as proposed below.
Your answer solved my query. However, For my second query, I tried googling Filtering FASTA, but was unable to get the answer, hence I am re posting my query with example sequences.
Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal.
Adding an answer should only be used for providing a solution to the question asked.
If my answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Welcome to biostars. Interesting guidelines for posting can be found in the following posts:
I'm not sure why you tagged 'R', but I would solve your first problem using
grep -c '^>' sequence.fasta
However, I am not getting the desired result.
Please be more specific with regard to the result that you get, and how that's not desired.
For your second question, can you use the fasta identifiers for filtering? I don't know what they look like. But there are tons of solutions on biostars for that, so googling a bit should get you some options.
Replace Human with your desired regular expression pattern. This approach should scale well to large inputs, I'd think, and be agnostic to variations in FASTA input formats.
This doesn't work because the
>
is a special character on the command-line; it redirects standard output into a file. You have to use single or double quotations as proposed below.Thanks a lot Wouter,
Your answer solved my query. However, For my second query, I tried googling Filtering FASTA, but was unable to get the answer, hence I am re posting my query with example sequences.
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.If my answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Welcome to biostars. Interesting guidelines for posting can be found in the following posts:
seqkit seq input.fa -n | wc -l
orgrep \> input.fa | wc -l