extract sequences from multi fasta using partial ID
2
0
Entering edit mode
10.0 years ago
dago ★ 2.8k

I have a file containing IDs:

YP_615060
YP_615061
YP_615062

and a multifasta file with IDs:

>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]

Now, I would like to extract all seq that contain the IDs in my IDs_file.

I tried:

cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
>YP_615060
[fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11

Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?

Update

I tried this great python code

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')</code>

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

that I found here, but I do not understand why I get an empty file as file 3

sequence fasta python • 9.0k views
ADD COMMENT
0
Entering edit mode

Adding a question to an existing post is not a good practice. It's always better to create a new post with the new question, so we can focus on one issue at a time.

Check out my updated answer - it has the reason why the python code doesn't work.

ADD REPLY
0
Entering edit mode

Hi,

I am trying to understand RamRS explanation but I am a bit lost.

I have also an input fasta file with header such as

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogene
>gi|316996529|dbj|AB545981.1|:415480-416630 Pyrus pyrifolia genes for F-box proteins, S ribonuclease, complete cds, haplotype: S4
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

and I want to retrieve only specific sequences, from an ID.txt:

AB270792.1
AB545982.1

At the end I would like a file like this

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogenes
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using xargs faidx. Could you please show update the Python script?

I tried the xargs faidx -d '|' all_bact_prot.faa < prot_id.txt with my infile, it generated a .fai, and I therefore don't understand the link with the python code...

thanks a lot for your help!!!

cheers

Amandine

ADD REPLY
0
Entering edit mode

This should be a comment on the relevant post, not an answer.

ADD REPLY
4
Entering edit mode
10.0 years ago

Direct drop-in for your current command. Install pyfaidx:

pip install --user pyfaidx
xargs faidx -d '|' all_bact_prot.faa < prot_id.txt
ADD COMMENT
1
Entering edit mode
10.0 years ago
Ram 44k

bioawk.

bioawk -c fastx '$name ~ /pattern/ { print ">"$name"\n"$seq"\n"; }'

Your part is to find out how to loop this with the contents from the ID file.

EDIT: Let's use xargs to capture input

xargs -0 -I bioawk -c fastx '$name ~ /{}/ { print ">"$name"\n"$seq"\n"; }' < IDs.txt

Your added steps do not work because the assumption of the code you used is that your accession IDs file has full IDs separated by |, whereas they do not. Your FASTA file is the one that has IDs with |

ADD COMMENT
0
Entering edit mode

We should change the name of the site from biostars to bioawk.

ADD REPLY
0
Entering edit mode

I just find bioawk really useful :) Apologies if I'm a little biased.

ADD REPLY

Login before adding your answer.

Traffic: 1480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6