Question

A Script To Find Each Seq In A Multiple Fasta Files

0

Entering edit mode

11.3 years ago

liran0921 ▴ 150

Hi all,

I have a list of 90 sequences and 35 datasets in fasta format (data1.fasta, data2.fasta, ...., data35.fasta).

    The list of the 90 sequences is in text format seperated by Tab:

   un1 aggcgaa

   un2 caagacc

    ...

   un90 gccagc

Now I would like to find out whether each sequence from the list is present in all the datasets (present in which dataset).

Basically, I think grep command would work:

  grep "aggcgaa" data1.fasta

  grep "aggcgaa" data2.fasta

  ...

  grep "aggcgaa" data35.fasta

If it is present in data1.fasta, print "data1.fasta"...

But it is too time consuming to do taht for each of the sequences.

Could anybody tell me how to use a loop to do this job? I have no idea how to use loop for this kind of work, but i suppose there should be a way.

Thanks.

fasta • 5.6k views

ADD COMMENT • link updated 11.3 years ago by Vivek Krishnakumar ▴ 400 • written 11.3 years ago by liran0921 ▴ 150

score 2 · Answer 1 · 2013-08-19

2

Entering edit mode

11.3 years ago

Ashutosh Pandey 12k

Put all the .fasta files in one folder.

grep -l "aggcgaa" fasta_folder/* will print all the file names that have that string. '*' is to search all the files.

If you pipe it to wc -l then you can get the number of files that have that string.

grep -l "aggcgaa" fasta_folder | wc -l

I will leave on you to how to put all the sequences in the tab file into a loop and execute the above command. it should be simple to do.

ADD COMMENT • link 11.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Hi, ashutoshmits, thanks for your reply.

The script should be like

for i in $(? )
do

         grep -l i fasta_folder/*.fasta | wc -l

done

But i am not familiar with linux loop. I am stuck in how to extract the seqs by using "for i in $(? ) " at the beginning of the loop. Could you show me how me the script here? Thanks again for your help.

ADD REPLY • link 11.3 years ago by liran0921 ▴ 150

0

Entering edit mode

If you have replaced ? in sample script with cat file_with_fasta_to_search then replace grep -l i fasta_folder/*.fasta | wc -l with grep $i fasta_folder/*.fasta this should give you the the sequences in your fasta file

ADD REPLY • link 11.3 years ago by Sudeep ★ 1.7k

score 1 · Answer 2 · 2013-08-20

You can try this. This is assuming that your sequences are in sequences.txt

#!/bin/sh  
awk -F "\t" ' { print $2 } ' sequences.txt > input_sequences.txt
ls data*.fasta > filelist

for sequence in `cat input_sequences.txt`
do
  for file in `cat filelist`
  do
    count=`grep -c "$sequence" $file`
    if [ $count -gt 0 ]; then
      echo "$sequence is present in $file"
    fi
  done
done

It is straight-forward. The first loop is for going through your sequences and the second-inner loop is for going through the files and grep-ing each sequence.

score 0 · Answer 3 · 2013-08-20

Extending @ashutoshmits' solution, and taking into account the fact that your OP mentions "The list of the 90 sequences is in text format seperated by Tab" (I'm going to assume that this file is named sequences.txt), you could do something like this:

cut -f2 sequences.txt | xargs -I{} grep -l "{}" data*.fasta

The above method cuts the sequences out from the 2nd column and then uses the unix xargs command to iterate through every element from the first pipe and pass it to the grep command (which uses the -l flag to only print the names of the FILEs that contain matches).

In order to speed this up, you can use the GNU Parallel tool, which acts in a fashion similar to xargs, but allows for naive parallelization (making use of as many available processors on your system), running several grep commands at the same time. Here is a example:

cut -f2 sequences.txt | parallel -j5 -k 'grep -l "{}" data*.fasta'

Here the -j5 parameter is instructnig parallel to run 5 grep jobs at a time (adjust according to your liking or let it just use all available cores on your machine) and the -k parameter is instructing it to retain the order of input passed to the command (so that the output is also in the same order; without this parameter, the output order will be jumbled depending on whichever process finishes earlier)