Locating a sequence in multiple fasta files
4
1
Entering edit mode
8.6 years ago
chups519 ▴ 10

Hi! I have some fasta files with a couple of sequences each. They are like this:

>WP_012937914.1 NADH-dependent butanol dehydrogenase [Acidaminococcus fermentans]
MLGNFTYCNPTKLYFGKDALQNLPAELAKYGKKVVLVYGSGSIKKNGIYDAVTKILAEAGKDVAEIAGVMSNPTIHKLRE
GIAIARKHGADFILAVGGGSVIDYSKALSVSVNCDEDPWEKYFVRFEEPTCETIPVGTVLTMVGTGSEMNAGAVITYPEK
KLKMGKVFADEKIMPRFSILNPVYTMTLPKYQMVAGIYDIFNHICEQYFSGTDDNTSDYLAEGLMRNVVEASRTAVKNPQ
DYEARSNLMWDATWALNTLIAKGKSTDWMVHMLGQAAGGVTNATHGMTLAAVSLPYYRFILKDGLPKFVRFAKVVWDVRP
EGKTDEEIAKEGLEKMEDWMQQLGLVLHLKDLGATEDMLDDLVNGTLILTGGYRVLTKEEIREIFRRAM

How can I extract,from this files, only the sequences that have in that first line "NADH" ? I need that first line as well as the sequence into another fasta file.

python sequence fasta • 3.2k views
ADD COMMENT
3
Entering edit mode
8.6 years ago

This would be a fine use case for bioawk, or pyfaidx:

$ pip install --user pyfaidx
$ faidx proteins.fa --regex "*NADH*" > NADH_proteins.fa
ADD COMMENT
1
Entering edit mode

I just tried this out, but although the python module installs just fine, I'm getting "command not found" when I try and run the script. Maybe it's getting copied to somewhere that isn't in my path (or not getting copied anywhere)?

OK, when I try it without the --user flag, it works, so I reckon it's that.

ADD REPLY
2
Entering edit mode

Yes, I should have mentioned that the --user flag places everything in $HOME/.local/bin, and so that's where the faidx script would go, instead of /usr/local/bin. Glad you've figured it out!

ADD REPLY
3
Entering edit mode
8.6 years ago
Markus ▴ 320

With Biopython you can do this as follows:

from Bio import SeqIO

your_fasta_files = ['your_first_file', 'your_second_file', ..., 'your_last_file']

nadh_entries = []
for fasta_file in your_fasta_files:
    entries = SeqIO.parse(fasta_file, 'fasta')
    for entry in entries:
        if 'NADH' in entry.description:
            nadh_entries.append(entry)

SeqIO.write(nadh_entries, 'your_nadh_fasta_output_file', 'fasta')
ADD COMMENT
0
Entering edit mode

to be on the safe side for capitalization, I would go for:

if 'nadh' in entry.descripton.lower():
ADD REPLY
0
Entering edit mode

@decosterwouter: actually I had it this way ('NADH' in ...upper()), but was unsure if there exist other English words in the fasta descriptions that contain 'nadh', like 'nonadherent'

ADD REPLY
1
Entering edit mode
ADD COMMENT
1
Entering edit mode

Does samtools faidx work with protein fasta?

ADD REPLY
0
Entering edit mode

There are more suggestions in that post!

ADD REPLY
1
Entering edit mode

That was a real question (not a snarky comment). Because if it does it would be useful.

EDIT: Samtools (v.1.3) faidx does work with protein fasta to bring back full sequences from a file.

ADD REPLY
0
Entering edit mode

For now the suggestions in the posto don't work, but thank you so much for the help. I'll keep trying

ADD REPLY
1
Entering edit mode

Since you have only posted a snippet from your file I am not sure if the following will work but try it. This assumes your file is in multi-fasta format.

 $ awk -v 'RS=>' '{if ($0 ~ /NADH/) print ">"$0}' your_file > NADH_file
ADD REPLY
0
Entering edit mode
8.6 years ago

I know the windows solution.

  1. Combine all fasta to a single fasta.

a. Keep all fasta in single folder (I assume they have .fasta extension

b. Open cmd and go to that folder

c. run

copy *.fasta merged.txt

let the commands run and close cmd. rename the merged.txt file to merged.fasta

  1. Convert to tab and open in excel. Convert it in fabox or fasta2tab

  2. Do the math in excel

a. open file in excel

b. assuming headers are in col a and sequences in col b, type this in col c1(including "=" sign

= if(countif(b1,"*NADH*")=0,0,1)

This will give 1 if the sequence initiates with NADH. 4 represents count for search string in this case 4 as NADH makes 4 alphabets.

c. Filter col c for value 1

d. Copy the first 2 columns and make a txt file out of it. Convert this tab file to fasta using Tab to fasta sequence converter.

ADD COMMENT
3
Entering edit mode

AAAAAAHHHHH! Not only does this not work for the OP's example (the "NADH" is not at the beginning of the definition line) but it involves way too many manual manipulations of the data. If you want to work with FASTA and answer this question on Windows, you can install Python for Windows and just run pyfaidx from the command prompt. There's no need to open your file in Excel.

Now, since I've stated my opinion, thanks for giving your answer, and it's obvious from your solution that you have something that works, and sometimes that's all that matters. However in this case there are better tools.

ADD REPLY
0
Entering edit mode

My apology. The markdown language is new to me. I didnt realise the asterik sign has been ommited from the final answer. I have modified equation to show the formula to detect "NADH" anywhere in sentence"

Python will be a great solution. We do it in PHP. But sometimes, learning a program is too tidious and if jobs simply can be done with office tools like excel and word, whats the harm in it!!

ADD REPLY
1
Entering edit mode

This is really a very bad way to solve a problem.

ADD REPLY
0
Entering edit mode

FABOX has an option to detect extract sequences containing spoecific headers or part of it. its slow and probably hangs for >200 sequences. Excel on 4 gb ram can very well handle a tab sequence file upto 33000 sequences averaging to 500 aa average length. A formula and filter in such case takes approx 2-5 min. I dont think, a beginner can program anything in python in this much short time. Please do consider the time in learning the software.

ADD REPLY

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6