Question

Get Filenames Of All Files With 'Atg' In Fasta Files

3

Entering edit mode

12.9 years ago

Sam ▴ 70

I have a bunch of files which are similar to following:

File00
>A18178 1
atgcaccaataaaaaaacaagcttaacctaattc

File01
>A21196 1
cggccagatcta

File02
>A21197 1
agcttagatctggccgggg

File03
>AX557348 1
gcggatttactcaggggagagcccagataa
atggagtctgtgcgtccaca

I need to look into each file and check if the 2nd line starts with 'atg'. for example, only File00 starts with atg. This needs to be done as a shell script.

To be honest, this is a school assignment but I have been thinking about this for last 2-3 days and I cant seem to get it. I have put together following command:

grep -H -m 1 '^atg' ./mrna_split/File* |awk -F ":" '{print $1}'

The problem with the above command is that it also gives file names that have atg on the second line. For example, File03 will also be outputed. The grep outputs: ./mrna_split/File00:atgcaccaataaaaaaacaagcttaacctaattc. The awk command then gets only the filename.

Any input is appreciated.

Thank you.

awk script • 2.8k views

ADD COMMENT • link updated 12.9 years ago by Neilfws 49k • written 12.9 years ago by Sam ▴ 70

score 5 · Answer 1 · 2012-02-14

5

Entering edit mode

12.9 years ago

Pierre Lindenbaum 164k

try:

find ./ -name "File[0-9]*" -exec awk '/^>/ {name=$0; next;} NR==2 {if(substr($0,1,3)=="atg") print name;}' '{}' ';'

(loop over each files, execute awk, get the fasta header, print it if the 2nd line starts with "atg")

ADD COMMENT • link 12.9 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

I think he only needs the filenames, not the sequence ids:

find ./ -name "File[0-9]*" -exec awk 'NR==2 {if(substr($0,1,3)=="atg") print FILENAME; exit}' '{}' ';'

should work

ADD REPLY • link 12.9 years ago by Andreas ★ 2.5k

1

Entering edit mode

I suppose it's possible that the first line of sequence be "AT" or even just "A", since the specification states only that lines should be shorter than 80 characters. However, I'd regard that as a very odd FASTA file.

ADD REPLY • link 12.9 years ago by Neilfws 49k

0

Entering edit mode

what if one atg sequence is splitted over two lines? e.g. ...ATnG...

ADD REPLY • link 12.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hey guys. Thanks for the reply. The above command (both andreas and Pierre) works perfectly. In the end, I am supposed to make a atg.fa file which has all the sequence that start with atg. right now, after i get the filename from the above script, i use for loop in shell script to cat the matched file with atg.fa . Is there a better way?

ADD REPLY • link 12.9 years ago by Sam ▴ 70

score 4 · Answer 2 · 2012-02-14

4

Entering edit mode

12.9 years ago

Neilfws 49k

Bioperl solution using Seq::IO. Save code as findatg.pl. Assumes 1 sequence per file (otherwise will print filename for every matching occurrence in file).

Now go learn some Bioperl, you won't regret it.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $file = shift || die("Usage: findatg.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");

while(my $fa = $seqio->next_seq) {
  print "$file\n" if($fa->seq =~ /^atg/i);
}

Then run something like:

find ./ -name "File*" -exec perl findatg.pl {} \;

ADD COMMENT • link 12.9 years ago by Neilfws 49k

7

Entering edit mode

Yes, I see your question tags now. There are 2 kinds of solutions in bioinformatics: "whatever works" and "whatever I was told I had to use." As a researcher, I favour the first one :)

ADD REPLY • link 12.9 years ago by Neilfws 49k

0

Entering edit mode

I know bioperl. However, I needed to do this as a shell script as i mentioned in my OP. Otherwise it was an easy task even with just perl.Thank you for your reply.

ADD REPLY • link 12.9 years ago by Sam ▴ 70