grep/awk issue: can't extract sequences from a fasta file
0
0
Entering edit mode
6.5 years ago

Hey everybody,

I have an issue and I'm completely running out of idea here.

I have a fasta file with mutiple sequences. I have an ID file with IDs corresponding to some of the sequences in the fasta file and I want to extract those sequences.

Typical task, I've done it many times either using grep or awk like:

 grep -A1 -w -f id_file.txt Input.fasta > output.fasta

Thing is this time, It doesn't work. Even though I can find the ID when using ctrl F look for in notepad. When I used grep or awk it doesn't give me anything. I've suspected hidden character or "to the next line" character issue but none of those were the issue.

Does any of you have an idea on what could prevent grep to recognize my sequences and to extract them?

Thank you very much for any help. Best,

Ben

fasta sequence extraction grep shell • 3.2k views
ADD COMMENT
2
Entering edit mode

Since you mention notepad I assume this file has been opened in that program on Windows? I suggest using dos2unix on your server to change the line endings to unix format to see if that fixes the problem.

Generally for the task you are referring to, faSomeRecords utility from Jim Kent (UCSC) works wonders. That tool is described in this recent thread along with several other options: Extract fasta sequences from a file using a list in another file.

ADD REPLY
0
Entering edit mode

Sorry, have to go to the dry cleaners.

Paste some sample data for the others. Also mention the system (operating system) and shell that you're using.

ADD REPLY
0
Entering edit mode

try

 tr -d '\r' < Input.fasta | grep -A1 -w -f <(tr -d '\r' < id_file.txt) > output.fasta
ADD REPLY
0
Entering edit mode

Hello benoit.kunath,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Additionally, please use appropriate tags. Your question is about "fasta sequence extraction". That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster. Thank you!

ADD REPLY
0
Entering edit mode

Hello,

Thanks everyone for your help. I used notepad++ to change the line endings on both the ID and the fasta file. I've also made sure that my sequences were properly on one line each.

Now, grep gives me a output, but some of the sequences are not complete. It looks like something like that:

>Protein52886 yypfschkkGQSGGPTSVINASAAGVIQEALKQDCITAVYGAAHGIKpfiisartltvtplffrilhhrrpqwkafllltaqyfllyprslllsiphkkhnailssrfrktl
 --

The sequences from notepad after conversion of the ending looks like that:

>Protein33753
RTTAEEIWRDTDGQVDIFVAGVGTGGTISGVGEVLKlvirdlvfcnytlphlrgkmrpsickaykdfydeiedifgmlds
nevntywnylknykfeggmlq
>Protein48634
IGYAMIKDAEEKGIINKDTVIIEPTSGNTGIALAFVAAARyymrfknrnfimirraynmftkaeeiyskfneeniqimip
kkllftllqqvdrllellsneevasnfaiydyisnaemlmvklyilsaepynqkeviletsiaeflvirdlvfcnytlph
lrgkmrpsickaykdfydeiedifgmldsnevntywnylknykfeggmlq

Any Idea why some of the sequences are not complete? I've had that issue before I think, but it was because my sequences were not "really" on 1 line. But this time I made sure it was using this command line:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' input.fasta | awk '{if($0~/^>/)print $1; else print}' > output.fasta

Thanks a lot for your help. (and thanks for the edit on the post. I'll pay more attention!

ADD REPLY
0
Entering edit mode

Please post problematic example input sequences and output sequences and your final code.

ADD REPLY
0
Entering edit mode

I posted that already, it's in my previous post

ADD REPLY
0
Entering edit mode

Use an established fasta parsing library or program. They've already ironed out all kinds of quirks of fasta files.

ADD REPLY
0
Entering edit mode

any example? Thanks

ADD REPLY
1
Entering edit mode

Perl: Bioperl module Bio::SeqIO
R: function read.fasta() in package seqinr
C: Li Heng's parser
Python: there should be an equivalent to SeqIO in biopython

ADD REPLY

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6