I have a file containing IDs:
YP_615060
YP_615061
YP_615062
and a multifasta file with IDs:
>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]
Now, I would like to extract all seq that contain the IDs in my IDs_file.
I tried:
cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
>YP_615060
[fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11
Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?
Update
I tried this great python code
f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')</code>
AI_DICT = {}
for line in f2:
AI_DICT[line[:-1]] = 1
skip = 0
for line in f1:
if line[0] == '>':
_splitline = line.split('|')
accessorIDWithArrow = _splitline[0]
accessorID = accessorIDWithArrow[1:-1]
# print accessorID
if accessorID in AI_DICT:
f3.write(line)
skip = 0
else:
skip = 1
else:
if not skip:
f3.write(line)
f1.close()
f2.close()
f3.close()
that I found here, but I do not understand why I get an empty file as file 3
Adding a question to an existing post is not a good practice. It's always better to create a new post with the new question, so we can focus on one issue at a time.
Check out my updated answer - it has the reason why the python code doesn't work.
Hi,
I am trying to understand RamRS explanation but I am a bit lost.
I have also an input fasta file with header such as
and I want to retrieve only specific sequences, from an ID.txt:
At the end I would like a file like this
But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using
xargs faidx
. Could you please show update the Python script?I tried the
xargs faidx -d '|' all_bact_prot.faa < prot_id.txt
with my infile, it generated a.fai
, and I therefore don't understand the link with the python code...thanks a lot for your help!!!
cheers
Amandine
This should be a comment on the relevant post, not an answer.