I have a bunch of files which are similar to following:
File00
>A18178 1
atgcaccaataaaaaaacaagcttaacctaattc
File01
>A21196 1
cggccagatcta
File02
>A21197 1
agcttagatctggccgggg
File03
>AX557348 1
gcggatttactcaggggagagcccagataa
atggagtctgtgcgtccaca
I need to look into each file and check if the 2nd line starts with 'atg'. for example, only File00 starts with atg. This needs to be done as a shell script.
To be honest, this is a school assignment but I have been thinking about this for last 2-3 days and I cant seem to get it. I have put together following command:
grep -H -m 1 '^atg' ./mrna_split/File* |awk -F ":" '{print $1}'
The problem with the above command is that it also gives file names that have atg on the second line. For example, File03 will also be outputed. The grep outputs: ./mrna_split/File00:atgcaccaataaaaaaacaagcttaacctaattc. The awk command then gets only the filename.
Any input is appreciated.
Thank you.
I think he only needs the filenames, not the sequence ids:
find ./ -name "File[0-9]*" -exec awk 'NR==2 {if(substr($0,1,3)=="atg") print FILENAME; exit}' '{}' ';'
should work
I suppose it's possible that the first line of sequence be "AT" or even just "A", since the specification states only that lines should be shorter than 80 characters. However, I'd regard that as a very odd FASTA file.
what if one atg sequence is splitted over two lines? e.g. ...ATnG...
Hey guys. Thanks for the reply. The above command (both andreas and Pierre) works perfectly. In the end, I am supposed to make a atg.fa file which has all the sequence that start with atg. right now, after i get the filename from the above script, i use for loop in shell script to cat the matched file with atg.fa . Is there a better way?