Hello all,
I have about 900 .fasta files, each comprised of nucleotide alignments for various genes. The files look something like the following (truncated for the sake of illustration):
>fig|395491.5.peg.776
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAACGCATCGTCGCCGGCCTCTAT
TTCGGCGGCATCATCCTCGCCGGCCTGTTCACTTTCGTGCCTGGCCGGATCATGCACGCG
GTCGTCTTCACCGGAGCGGAATGGCCGGCTGCCCTTGCTGCGGCGGTGATG---------
---------------------------------------------------
>fig|395491.28.peg.1692
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAACGCATCGTCGCCGGCCTTTAT
TTCGGCGGCATCGTCCTTGCCGGCCTGTTCACTTTTGTGCCTGGCCGGATCATGCACGCG
GTCGTCTTCACCGGAACGGAATCGCCGGCTGCCCTTGCTGTGGCGGTGATG---------
---------------------------------------------------
>fig|395491.777.peg.2918
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAGCGCATCGTCGCCGGCCTTTAT
TTCGGCGGCATCGTCCTTGCCGGCCTGTTCACTTTCGTGCCTGGCCGGATCATGCATGCG
GTCATCTTCACCGGCGCGGAATGGCCGGCTGTCCTTGCTGCGGCGGTGATCGGCTCGATC
CTGATCGCAATCGCG---CTGCGTCGCCGACGCGGGCGCCTGATTGCCCGA
What I want to do is search through each file and remove the DNA sequence associated with genome "fig|395491.777." Each file, however, has a different ".peg." number added onto the "fig|395491.777" title.
So basically, if the sequence name begins with "fig|395491.777," I want it to be removed. I'm sure there is a relatively straightforward way to complete this task, but it's a mystery to me. I'd appreciate any assistance with this issue.
Thanks in advance!
Have you tried anything yet? People in this forum would appreciate if you tell them that you tried an awk or grep command or some perl script. Look at this and try: http://stackoverflow.com/questions/15848713/how-to-extract-fasta-sequences-in-a-file-which-header-line-matches-with-list-in
I am using a linux server and I am very new to utilizing command line functions and/or scripting, so I tried a few scripts that I found while browsing various pages on this forum (though it seems most of the approaches required an exact [i.e. all characters included] header name for a sequence to be removed). For example, I tried running the following python script (though it doesn't seem to work even if I give it an exact header name):
I looked into the thread you suggested, though admittedly I'm not sure how I would adapt that to my particular problem.
cat *.fasta > big_file.fasta
)awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' big_file.fasta > big_file.tab
(This will give you a tab delimited fasta file with header as the first column and sequence as the second column)grep "fig|395491.777." big_file.tab > small_file.tab
)Meanwhile keep working on the script.
I just read your question again and I realized that you want to remove those sequences. You can use
grep -v
for individual files than concatenating everything into one. You will have to write a loop that goes through all the files. Can be simple done in shell.You'll find python to be a pretty useful language. The line you want to change is
if nuc not in remove and len(nuc) > 0:
. That's looking for an exact match. You could instead look to see ifnuc
starts withremove
withif nuc.startswith(remove,1) and len(nuc) > 0 :
. This would only work with a single sequence to remove (you could just hard code that in rather than reading it from a file). If you had a list, then just usenuc.find("peg")
to find the numeric index to"peg"e;
and then subsetnuc
(then using theif...
clause as is).Biopython is installed on your server?
What have you tried? This is doable in pretty much any scripting language.
As shown in my response to Ashutosh, I tried running a python script but it doesn't seem to work. I also tried to utilize the Perl code found in this thread (but I don't fully understand how I might use it for my particular question).