Question

Locating a sequence in multiple fasta files

1

Entering edit mode

9.3 years ago

chups519 ▴ 10

Hi! I have some fasta files with a couple of sequences each. They are like this:

>WP_012937914.1 NADH-dependent butanol dehydrogenase [Acidaminococcus fermentans]
MLGNFTYCNPTKLYFGKDALQNLPAELAKYGKKVVLVYGSGSIKKNGIYDAVTKILAEAGKDVAEIAGVMSNPTIHKLRE
GIAIARKHGADFILAVGGGSVIDYSKALSVSVNCDEDPWEKYFVRFEEPTCETIPVGTVLTMVGTGSEMNAGAVITYPEK
KLKMGKVFADEKIMPRFSILNPVYTMTLPKYQMVAGIYDIFNHICEQYFSGTDDNTSDYLAEGLMRNVVEASRTAVKNPQ
DYEARSNLMWDATWALNTLIAKGKSTDWMVHMLGQAAGGVTNATHGMTLAAVSLPYYRFILKDGLPKFVRFAKVVWDVRP
EGKTDEEIAKEGLEKMEDWMQQLGLVLHLKDLGATEDMLDDLVNGTLILTGGYRVLTKEEIREIFRRAM

How can I extract,from this files, only the sequences that have in that first line "NADH" ? I need that first line as well as the sequence into another fasta file.

python sequence fasta • 3.9k views

ADD COMMENT • link updated 9.3 years ago by Markus ▴ 320 • written 9.3 years ago by chups519 ▴ 10

score 3 · Answer 1 · 2016-04-19

3

Entering edit mode

9.3 years ago

Matt Shirley 10k

This would be a fine use case for bioawk, or pyfaidx:

$ pip install --user pyfaidx
$ faidx proteins.fa --regex "*NADH*" > NADH_proteins.fa

ADD COMMENT • link 9.3 years ago by Matt Shirley 10k

1

Entering edit mode

I just tried this out, but although the python module installs just fine, I'm getting "command not found" when I try and run the script. Maybe it's getting copied to somewhere that isn't in my path (or not getting copied anywhere)?

OK, when I try it without the --user flag, it works, so I reckon it's that.

ADD REPLY • link 9.3 years ago by gearoid ▴ 200

2

Entering edit mode

Yes, I should have mentioned that the --user flag places everything in $HOME/.local/bin, and so that's where the faidx script would go, instead of /usr/local/bin. Glad you've figured it out!

ADD REPLY • link 9.3 years ago by Matt Shirley 10k

score 3 · Answer 2 · 2016-04-21

3

Entering edit mode

9.3 years ago

Markus ▴ 320

With Biopython you can do this as follows:

from Bio import SeqIO

your_fasta_files = ['your_first_file', 'your_second_file', ..., 'your_last_file']

nadh_entries = []
for fasta_file in your_fasta_files:
    entries = SeqIO.parse(fasta_file, 'fasta')
    for entry in entries:
        if 'NADH' in entry.description:
            nadh_entries.append(entry)

SeqIO.write(nadh_entries, 'your_nadh_fasta_output_file', 'fasta')

ADD COMMENT • link 9.3 years ago by Markus ▴ 320

0

Entering edit mode

to be on the safe side for capitalization, I would go for:

if 'nadh' in entry.descripton.lower():

ADD REPLY • link 9.3 years ago by WouterDeCoster 48k

0

Entering edit mode

@decosterwouter: actually I had it this way ('NADH' in ...upper()), but was unsure if there exist other English words in the fasta descriptions that contain 'nadh', like 'nonadherent'

ADD REPLY • link 9.3 years ago by Markus ▴ 320

score 1 · Answer 3 · 2016-04-19

1

Entering edit mode

9.3 years ago

Benn 8.4k

Maybe this post helps

How To Extract A Sequence From A Big (6Gb) Multifasta File ?

ADD COMMENT • link 9.3 years ago by Benn 8.4k

1

Entering edit mode

Does samtools faidx work with protein fasta?

ADD REPLY • link 9.3 years ago by GenoMax 153k

0

Entering edit mode

There are more suggestions in that post!

ADD REPLY • link 9.3 years ago by Benn 8.4k

1

Entering edit mode

That was a real question (not a snarky comment). Because if it does it would be useful.

EDIT: Samtools (v.1.3) faidx does work with protein fasta to bring back full sequences from a file.

ADD REPLY • link 9.3 years ago by GenoMax 153k

0

Entering edit mode

For now the suggestions in the posto don't work, but thank you so much for the help. I'll keep trying

ADD REPLY • link 9.3 years ago by chups519 ▴ 10

1

Entering edit mode

Since you have only posted a snippet from your file I am not sure if the following will work but try it. This assumes your file is in multi-fasta format.

 $ awk -v 'RS=>' '{if ($0 ~ /NADH/) print ">"$0}' your_file > NADH_file

ADD REPLY • link 9.3 years ago by GenoMax 153k

score 0 · Answer 4 · 2016-04-20

0

Entering edit mode

9.3 years ago

saurabh.gayali ▴ 30

I know the windows solution.

Combine all fasta to a single fasta.

a. Keep all fasta in single folder (I assume they have .fasta extension

b. Open cmd and go to that folder

c. run

copy *.fasta merged.txt

let the commands run and close cmd. rename the merged.txt file to merged.fasta

Convert to tab and open in excel. Convert it in fabox or fasta2tab
Do the math in excel

a. open file in excel

b. assuming headers are in col a and sequences in col b, type this in col c1(including "=" sign

= if(countif(b1,"*NADH*")=0,0,1)

This will give 1 if the sequence initiates with NADH. 4 represents count for search string in this case 4 as NADH makes 4 alphabets.

c. Filter col c for value 1

d. Copy the first 2 columns and make a txt file out of it. Convert this tab file to fasta using Tab to fasta sequence converter.

ADD COMMENT • link 8.4 years ago by saurabh.gayali ▴ 30

3

Entering edit mode

AAAAAAHHHHH! Not only does this not work for the OP's example (the "NADH" is not at the beginning of the definition line) but it involves way too many manual manipulations of the data. If you want to work with FASTA and answer this question on Windows, you can install Python for Windows and just run pyfaidx from the command prompt. There's no need to open your file in Excel.

Now, since I've stated my opinion, thanks for giving your answer, and it's obvious from your solution that you have something that works, and sometimes that's all that matters. However in this case there are better tools.

ADD REPLY • link 9.3 years ago by Matt Shirley 10k

0

Entering edit mode

My apology. The markdown language is new to me. I didnt realise the asterik sign has been ommited from the final answer. I have modified equation to show the formula to detect "NADH" anywhere in sentence"

Python will be a great solution. We do it in PHP. But sometimes, learning a program is too tidious and if jobs simply can be done with office tools like excel and word, whats the harm in it!!

ADD REPLY • link 8.4 years ago by saurabh.gayali ▴ 30

1

Entering edit mode

This is really a very bad way to solve a problem.

ADD REPLY • link 9.3 years ago by WouterDeCoster 48k

0

Entering edit mode

FABOX has an option to detect extract sequences containing spoecific headers or part of it. its slow and probably hangs for >200 sequences. Excel on 4 gb ram can very well handle a tab sequence file upto 33000 sequences averaging to 500 aa average length. A formula and filter in such case takes approx 2-5 min. I dont think, a beginner can program anything in python in this much short time. Please do consider the time in learning the software.

ADD REPLY • link 8.4 years ago by saurabh.gayali ▴ 30