Question

Extract sequences from a list of ID

2

Entering edit mode

4.7 years ago

doinelpierrot ▴ 50

Hello everybody,

Here i my issue, I have a Trinity.fasta file like this

>TRINITY_DN5631_c0_g2_i1 len=947 path=[0:0-946]
TACAACTTGAACATCAACAATGGTTGCGCAGCTATTGCCATCCGCGACGTTCGAGGACTGCGTGCGAA
>TRINITY_DN62279_c1_g1_i1 len=298 path=[0:0-297]
TATTACCATTATTATTATTATCATATTTATGTTCATTATTATCATTATCATAATCATTATCATCTTGATA
 ...

And I also have a list of id in a id.txt file :

TRINITY_DN16359_c0_g1_i4
TRINITY_DN62279_c1_g1_i1
...

I am trying to extract for my fasta file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success :

 seqkit grep -n -f id.txt Trinity.fasta -o result.fa

Does anyone know how to fix it ?

software error sequence • 24k views

ADD COMMENT • link written 4.7 years ago by doinelpierrot ▴ 50

0

Entering edit mode

This has been asked multiple times. Here is one thread: Extract fasta sequnces not matching a list Answers include extracting both matching and non-matching sequences. If you want to keep using seqkit try: C: Extract specific sequence from FASTA file

ADD REPLY • link 4.7 years ago by GenoMax 152k

0

Entering edit mode

Yes I have checked biostars first that's how I found the seqkit command I am trying to use. My goal is not to remove sequence from my fasta but to have a file with all the sequences that are in my list

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

1

Entering edit mode

Did you check the links I included above? Second link has seqkit command, if you want to stay with that program. -nrif option mentioned there should cover your use case, I would think.

ADD REPLY • link 4.7 years ago by GenoMax 152k

1

Entering edit mode

seqkit grep -nrif ids.txt test.fa totally did the job thank you very much @genomax !

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

0

Entering edit mode

finally it does not work so great since as shenwei says, " it may produce some unwanted results. For example, seq_1 matches seq_10 with -nri"

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

0

Entering edit mode

If your sequences are only one line you can use this command:

cat id.txt | while read line ; do grep -A 1 ${line} file.fasta >> outputfile.fasta ; done

If your sequences are multi-lines you can convert them to a one-liner fasta file first and then use the above command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  Trinity.fasta >> file.fasta

ADD REPLY • link 4.7 years ago by Fatima ▴ 1000

0

Entering edit mode

Thank for your answer I have tried this but there is 6835 sequences in outputfile.fasta whereas I have a liste of 6750 ID. Is it possible to fix that ?

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

0

Entering edit mode

I usually use this for accession numbers that are unique, and ${line} works well for them, but maybe you can try "${line} ", because this one requires a space after ID, and 10 won't be picked when looking for 1.

Alternatively, you can use grep --perl-regex "${line}\s", but I haven't tried mixing this with -A1 or -A 1 option. You can try!

ADD REPLY • link 4.7 years ago by Fatima ▴ 1000

0

Entering edit mode

Option -w should be added to grep for exact matching.

ADD REPLY • link 4.7 years ago by shenwei356 8.7k

0

Entering edit mode

Hi, i want to extract seq from list of id but Seqkit gives this below error and empty output file generated

[WARN] symbol ">" detected, it should not be a part of the sequence ID/name: >CP077971.1_4943 P_aeruginosa_ZPPH33resistance-nodulation-cell division (RND) antibiotic efflux pumpType B NfxB

Thank you

ADD REPLY • link 2.9 years ago by Neel ▴ 20

0

Entering edit mode

You need to remove > from ID's as the warning tells you to do.

ADD REPLY • link 2.9 years ago by GenoMax 152k

0

Entering edit mode

Thank you....It has been done.

ADD REPLY • link 2.9 years ago by Neel ▴ 20

1

Entering edit mode

4.7 years ago

Renesh ★ 2.2k

You can also try Python package bioinfokit (v1.0.2 or later) for extracting the sequences from the FASTA file (check how to install https://reneshbedre.github.io/blog/howtoinstall.html#how-to-install)

from bioinfokit.analys import fasta

fasta.extract_seq(file='Trinity.fasta', id='id.txt')

Check detailed usage at https://reneshbedre.github.io/blog/howtoinstall.html#extract-sequences-from-fasta-file

ADD COMMENT • link 4.7 years ago by Renesh ★ 2.2k

1

Entering edit mode

4.7 years ago

hxj5 ▴ 10

Could also use seqtk or AWK from an old thread Tool: Retrieve a subset of FASTA from large Illumina multi-FASTA file:
seqtk subseq large.fa id.txt
or
awk 'BEGIN{while((getline<"id.txt")>0)l[">"$1]=1}/^>/{f=l[$1]}f' large.fa

ADD COMMENT • link 4.7 years ago by hxj5 ▴ 10

0

Entering edit mode

3.6 years ago

Md • 0

you can use samtools command :

samtools faidx Trinity.fasta TRINITY_DN16359_c0_g1_i4 TRINITY_DN62279_c1_g1_i1

ADD COMMENT • link 3.6 years ago by Md • 0

score 7 · Accepted Answer · 2020-10-26

7

Entering edit mode

4.7 years ago

shenwei356 8.7k

Referring to the usage, you should not switch on -n , just use seqkit grep -f id.txt seqs.fa.

-n, --by-name               match by full name instead of just id
-i, --ignore-case           ignore case
-r, --use-regexp            patterns are regular expression

For -nrif, it's partly matching full FASTA header by regular expression and case-ignored, with patterns from file. This may produce some unwanted results. For example, seq_1 matches seq_10 with -nri.

ADD COMMENT • link 4.7 years ago by shenwei356 8.7k

0

Entering edit mode

Hmm unwanted results, it's exactly what I got since I ended up with 20 mysterious sequences more than expected (out of 7000) wit -nrif.

I have tried seqkit grep -f id.txt Trinity.fasta > contaminants.fasta but unsuccesfully ...

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

1

Entering edit mode

Hmm... I've tested with your sample data, everything is OK.

You may check the input data (check extra whitespace or tab at front/end of IDs, file encoding, even manually copy and search in the file with text editor.), and try with additional options, e.g., switch on -i to ignore cases.

ADD REPLY • link 4.7 years ago by shenwei356 8.7k

0

Entering edit mode

My bad it did the job, thanks again !

ADD REPLY • link 4.7 years ago by doinelpierrot ▴ 50

0

Entering edit mode

thanks

seqkit grep -r -f id.txt Cucumis_melo.Melonv4.cds.all.fa -o result.fa

it works!

ADD REPLY • link 3.1 years ago by Ishak • 0