So I have a fasta file that contains a list of influenza proteins (PB2, PB1, PB1-F2, PA-X, HA, NP, NA, M1, M2, NS1 and NS2). I'm trying to use grep to pull out the headers containing individual proteins e.g. grep "PB2" fastafile
This works fine for most of the proteins, but with PB1 and PA, (grep "PB1" fastafile or grep "PA" fastafile) it doesn't just return the headers containing PB1 or PA but also the headers containing PB1-F2 and PA-X.
I've tried playing around with regexs (e.g. "PB1$") but that doesn't appear to solve the issue either.
Please post example input headers and expected output headers. In the absence of any data to process, I would suggest trying grep -w "PA" fastafile. But grep may not be sufficient for multi-line fasta.
For this job, I would suggest to use seqkit like below:
$ seqkit -w 0 grep -irp "\|PA$" test.fa
If you want separate fasta files as per each entry (PA, PA-X, PB1), look at the following example code:
$ tree .
.
└── test.fa
0 directories, 1 file
$ seqkit -w 0 split -i --id-regexp ".*\|(.*)$" -2 test.fa -O out --quiet
[INFO] create FASTA index for test.fa
$ tree .
.
├── out
│ ├── test.id_PA.fasta
│ ├── test.id_PA-X.fasta
│ ├── test.id_PB1-F2.fasta
│ └── test.id_PB1.fasta
├── test.fa
└── test.fa.seqkit.fai
1 directory, 6 files
$ cd out
$ rename -n 's/test\.id_//g' *.fasta
'test.id_PA.fasta' would be renamed to 'PA.fasta'
'test.id_PA-X.fasta' would be renamed to 'PA-X.fasta'
'test.id_PB1-F2.fasta' would be renamed to 'PB1-F2.fasta'
'test.id_PB1.fasta' would be renamed to 'PB1.fasta'
You can also use awk for this. But you need to flatten your fasta file for a simpler awk code.
Please post example input headers and expected output headers. In the absence of any data to process, I would suggest trying
grep -w "PA" fastafile
. But grep may not be sufficient for multi-line fasta.Hi, thanks for getting back to me.
The headers look something like this:
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X
I want to be able to pull the PB1/PA headers separately to the PB1-F2/PA-X headers. If I try:
It returns both the PB1 and PB1-F2 header and the same for "PA". Any ideas?