Hello everybody,
Here i my issue, I have a Trinity.fasta file like this
>TRINITY_DN5631_c0_g2_i1 len=947 path=[0:0-946]
TACAACTTGAACATCAACAATGGTTGCGCAGCTATTGCCATCCGCGACGTTCGAGGACTGCGTGCGAA
>TRINITY_DN62279_c1_g1_i1 len=298 path=[0:0-297]
TATTACCATTATTATTATTATCATATTTATGTTCATTATTATCATTATCATAATCATTATCATCTTGATA
...
And I also have a list of id in a id.txt file :
TRINITY_DN16359_c0_g1_i4
TRINITY_DN62279_c1_g1_i1
...
I am trying to extract for my fasta file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success :
seqkit grep -n -f id.txt Trinity.fasta -o result.fa
Does anyone know how to fix it ?
This has been asked multiple times. Here is one thread: Extract fasta sequnces not matching a list Answers include extracting both matching and non-matching sequences. If you want to keep using
seqkit
try: C: Extract specific sequence from FASTA fileYes I have checked biostars first that's how I found the seqkit command I am trying to use. My goal is not to remove sequence from my fasta but to have a file with all the sequences that are in my list
Did you check the links I included above? Second link has
seqkit
command, if you want to stay with that program.-nrif
option mentioned there should cover your use case, I would think.seqkit grep -nrif ids.txt test.fa totally did the job thank you very much @genomax !
finally it does not work so great since as shenwei says, " it may produce some unwanted results. For example, seq_1 matches seq_10 with -nri"
If your sequences are only one line you can use this command:
If your sequences are multi-lines you can convert them to a one-liner fasta file first and then use the above command:
Thank for your answer I have tried this but there is 6835 sequences in outputfile.fasta whereas I have a liste of 6750 ID. Is it possible to fix that ?
I usually use this for accession numbers that are unique, and ${line} works well for them, but maybe you can try
"${line} "
, because this one requires a space after ID, and 10 won't be picked when looking for 1.Alternatively, you can use
grep --perl-regex "${line}\s"
, but I haven't tried mixing this with -A1 or-A 1
option. You can try!Option
-w
should be added togrep
for exact matching.Hi, i want to extract seq from list of id but Seqkit gives this below error and empty output file generated
[WARN] symbol ">" detected, it should not be a part of the sequence ID/name: >CP077971.1_4943 P_aeruginosa_ZPPH33resistance-nodulation-cell division (RND) antibiotic efflux pumpType B NfxB
Thank you
You need to remove
>
from ID's as the warning tells you to do.Thank you....It has been done.