Printing specific sequence and ID from combined fasta file using bash commands
4
2
Entering edit mode
7.0 years ago
SaltedPork ▴ 170

I have a combined fasta file with all my sequences. I want to print the ID lines and their DNA sequence that end in .4

So far I have awk /'>*.4'/ {getline;print} combined.fasta Which prints the sequences that I want, how do I get the ID lines as well?

fasta bash • 3.5k views
ADD COMMENT
0
Entering edit mode

Can you show an example of your input data and ideal output?

ADD REPLY
0
Entering edit mode

Thanks for responding, I have ID's that end in .1, .2 and so on up until .8. I just want the .4's and their sequence.

Input:

>16U035667-26_S26_L001.1
AGCTACGT
>16U035667-26_S26_L001.2
AGCTAACGTAC
>16U035667-26_S26_L001.4
ACGTACGTACTGAC

Output:

>16U035667-26_S26_L001.4
ACGTACGTACTGAC
ADD REPLY
2
Entering edit mode
7.0 years ago
Joe 21k

Since you requested bash only:

#!/bin/bash

string="$2"

while read line ; do
    if [[ ${line:0:1} == ">" ]] ; then
        header="$line"
    else
        seq="$line"
        if [[ "$header" == *"$string" ]]; then
            echo -e "$header""\n""$seq"
        fi
    fi
done < $1

Put it in a script file and run it like:

$ bash parseheaders.sh seqs.fasta .4

Some of my own test data:

$ cat seqs.fasta
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC

$ bash parseheaders.sh seqs.fasta 947
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC
ADD COMMENT
1
Entering edit mode

This works very well, many thanks! PS. also didn't need to put quotes around the .4.

ADD REPLY
2
Entering edit mode

You can accept multiple answers (green check mark). So you should accept this one one as well.

ADD REPLY
1
Entering edit mode

Just as a final comment though, I'd advise following some of the other suggestions here and use proper parsers like bioperl, biopython, bioawk and so on. My own personal go-to script for pulling out sequences is here, using Biopython (though it finds the key of interest anywhere in the header, not just at the end, so it wasn't directly applicable to this case.

ADD REPLY
0
Entering edit mode

It shouldn't, but in case the . doesn't play nicely with your terminal, just enclose the string to look for in quotes: ".4"

ADD REPLY
2
Entering edit mode
7.0 years ago
grep '^>.*4$' -A 1 --no-group-separator in.fa
ADD COMMENT
0
Entering edit mode

Fantastic one-liner, many thanks!

ADD REPLY
1
Entering edit mode

I believe this will only work so long as your sequences are single lines FYI

ADD REPLY
2
Entering edit mode
7.0 years ago

How about:

awk '/>*.4/ {print $0; getline; print}' combined.fasta
ADD COMMENT
0
Entering edit mode

hmmmm, I just get everything from combined.fasta printed to terminal.

ADD REPLY
1
Entering edit mode

I'm on Mac OS. You may need to adjust the syntax a bit. Sorry....

ADD REPLY
0
Entering edit mode

Odd. Works fine on CentOS.

ADD REPLY
1
Entering edit mode
7.0 years ago
Ram 44k

Check out bioawk - that should address your use case very well.

ADD COMMENT

Login before adding your answer.

Traffic: 2678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6