Extract sequences from a list of ID
4
2
Entering edit mode
4.1 years ago

Hello everybody,

Here i my issue, I have a Trinity.fasta file like this

>TRINITY_DN5631_c0_g2_i1 len=947 path=[0:0-946]
TACAACTTGAACATCAACAATGGTTGCGCAGCTATTGCCATCCGCGACGTTCGAGGACTGCGTGCGAA
>TRINITY_DN62279_c1_g1_i1 len=298 path=[0:0-297]
TATTACCATTATTATTATTATCATATTTATGTTCATTATTATCATTATCATAATCATTATCATCTTGATA
 ...

And I also have a list of id in a id.txt file :

TRINITY_DN16359_c0_g1_i4
TRINITY_DN62279_c1_g1_i1
...

I am trying to extract for my fasta file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success :

 seqkit grep -n -f id.txt Trinity.fasta -o result.fa

Does anyone know how to fix it ?

software error sequence • 18k views
ADD COMMENT
0
Entering edit mode

This has been asked multiple times. Here is one thread: Extract fasta sequnces not matching a list Answers include extracting both matching and non-matching sequences. If you want to keep using seqkit try: C: Extract specific sequence from FASTA file

ADD REPLY
0
Entering edit mode

Yes I have checked biostars first that's how I found the seqkit command I am trying to use. My goal is not to remove sequence from my fasta but to have a file with all the sequences that are in my list

ADD REPLY
1
Entering edit mode

Did you check the links I included above? Second link has seqkit command, if you want to stay with that program. -nrif option mentioned there should cover your use case, I would think.

ADD REPLY
1
Entering edit mode

seqkit grep -nrif ids.txt test.fa totally did the job thank you very much @genomax !

ADD REPLY
0
Entering edit mode

finally it does not work so great since as shenwei says, " it may produce some unwanted results. For example, seq_1 matches seq_10 with -nri"

ADD REPLY
0
Entering edit mode

If your sequences are only one line you can use this command:

cat id.txt | while read line ; do grep -A 1 ${line} file.fasta >> outputfile.fasta ; done

If your sequences are multi-lines you can convert them to a one-liner fasta file first and then use the above command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  Trinity.fasta >> file.fasta
ADD REPLY
0
Entering edit mode

Thank for your answer I have tried this but there is 6835 sequences in outputfile.fasta whereas I have a liste of 6750 ID. Is it possible to fix that ?

ADD REPLY
0
Entering edit mode

I usually use this for accession numbers that are unique, and ${line} works well for them, but maybe you can try "${line} ", because this one requires a space after ID, and 10 won't be picked when looking for 1.

Alternatively, you can use grep --perl-regex "${line}\s", but I haven't tried mixing this with -A1 or -A 1 option. You can try!

ADD REPLY
0
Entering edit mode

Option -w should be added to grep for exact matching.

ADD REPLY
0
Entering edit mode

Hi, i want to extract seq from list of id but Seqkit gives this below error and empty output file generated

[WARN] symbol ">" detected, it should not be a part of the sequence ID/name: >CP077971.1_4943 P_aeruginosa_ZPPH33resistance-nodulation-cell division (RND) antibiotic efflux pumpType B NfxB

Thank you

ADD REPLY
0
Entering edit mode

You need to remove > from ID's as the warning tells you to do.

ADD REPLY
0
Entering edit mode

Thank you....It has been done.

ADD REPLY
7
Entering edit mode
4.1 years ago

Referring to the usage, you should not switch on -n , just use seqkit grep -f id.txt seqs.fa.

-n, --by-name               match by full name instead of just id
-i, --ignore-case           ignore case
-r, --use-regexp            patterns are regular expression

For -nrif, it's partly matching full FASTA header by regular expression and case-ignored, with patterns from file. This may produce some unwanted results. For example, seq_1 matches seq_10 with -nri.

ADD COMMENT
0
Entering edit mode

Hmm unwanted results, it's exactly what I got since I ended up with 20 mysterious sequences more than expected (out of 7000) wit -nrif.

I have tried seqkit grep -f id.txt Trinity.fasta > contaminants.fasta but unsuccesfully ...

ADD REPLY
1
Entering edit mode

Hmm... I've tested with your sample data, everything is OK.

You may check the input data (check extra whitespace or tab at front/end of IDs, file encoding, even manually copy and search in the file with text editor.), and try with additional options, e.g., switch on -i to ignore cases.

ADD REPLY
0
Entering edit mode

My bad it did the job, thanks again !

ADD REPLY
0
Entering edit mode

thanks

seqkit grep -r -f id.txt Cucumis_melo.Melonv4.cds.all.fa -o result.fa

it works!

ADD REPLY
1
Entering edit mode
4.1 years ago
Renesh ★ 2.2k

You can also try Python package bioinfokit (v1.0.2 or later) for extracting the sequences from the FASTA file (check how to install https://reneshbedre.github.io/blog/howtoinstall.html#how-to-install)

from bioinfokit.analys import fasta

fasta.extract_seq(file='Trinity.fasta', id='id.txt')

Check detailed usage at https://reneshbedre.github.io/blog/howtoinstall.html#extract-sequences-from-fasta-file

ADD COMMENT
1
Entering edit mode
4.1 years ago
hxj5 ▴ 10

Could also use seqtk or AWK from an old thread Tool: Retrieve a subset of FASTA from large Illumina multi-FASTA file:
seqtk subseq large.fa id.txt
or
awk 'BEGIN{while((getline<"id.txt")>0)l[">"$1]=1}/^>/{f=l[$1]}f' large.fa

ADD COMMENT
0
Entering edit mode
2.9 years ago
Md • 0

you can use samtools command :

samtools faidx Trinity.fasta TRINITY_DN16359_c0_g1_i4 TRINITY_DN62279_c1_g1_i1
ADD COMMENT

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6