How to match an exact string in a fasta header whilst excluding matches followed by hyphen?
1
0
Entering edit mode
3.1 years ago

Hi,

So I have a fasta file that contains a list of influenza proteins (PB2, PB1, PB1-F2, PA-X, HA, NP, NA, M1, M2, NS1 and NS2). I'm trying to use grep to pull out the headers containing individual proteins e.g. grep "PB2" fastafile

This works fine for most of the proteins, but with PB1 and PA, (grep "PB1" fastafile or grep "PA" fastafile) it doesn't just return the headers containing PB1 or PA but also the headers containing PB1-F2 and PA-X.

I've tried playing around with regexs (e.g. "PB1$") but that doesn't appear to solve the issue either.

Does anyone have an idea of how to solve this?

grep hyphen header fasta • 1.5k views
ADD COMMENT
0
Entering edit mode

Please post example input headers and expected output headers. In the absence of any data to process, I would suggest trying grep -w "PA" fastafile. But grep may not be sufficient for multi-line fasta.

ADD REPLY
0
Entering edit mode

Hi, thanks for getting back to me.

The headers look something like this:

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

I want to be able to pull the PB1/PA headers separately to the PB1-F2/PA-X headers. If I try:

grep -w "PB1" fastafile

It returns both the PB1 and PB1-F2 header and the same for "PA". Any ideas?

ADD REPLY
1
Entering edit mode
3.1 years ago
$ cat test.txt          
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

$ grep -w "PA" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

$ grep -w "PA$" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA

$ grep -w "PB1" test.txt 
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2

$ grep -w "PB1$" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1

For this job, I would suggest to use seqkit like below:

$ seqkit -w 0 grep -irp "\|PA$" test.fa

If you want separate fasta files as per each entry (PA, PA-X, PB1), look at the following example code:

$ tree .               
.
└── test.fa

0 directories, 1 file

$ seqkit -w 0 split -i --id-regexp ".*\|(.*)$" -2 test.fa -O out --quiet
[INFO] create FASTA index for test.fa

$ tree .
.
├── out
│   ├── test.id_PA.fasta
│   ├── test.id_PA-X.fasta
│   ├── test.id_PB1-F2.fasta
│   └── test.id_PB1.fasta
├── test.fa
└── test.fa.seqkit.fai

1 directory, 6 files

$ cd out 

$ rename -n 's/test\.id_//g' *.fasta                                    
'test.id_PA.fasta' would be renamed to 'PA.fasta'
'test.id_PA-X.fasta' would be renamed to 'PA-X.fasta'
'test.id_PB1-F2.fasta' would be renamed to 'PB1-F2.fasta'
'test.id_PB1.fasta' would be renamed to 'PB1.fasta'

You can also use awk for this. But you need to flatten your fasta file for a simpler awk code.

ADD COMMENT
0
Entering edit mode

Thanks for that. I couldn't get the grep -w "PA$" option to work, but seqkit split worked a treat!

ADD REPLY

Login before adding your answer.

Traffic: 2410 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6