Fasta headers column spilt or selection
2
0
Entering edit mode
2.2 years ago

How to take a specific column in sequence header identifiers of fasta file?

I am having my header such as:

>PGM0100236.1 [Candida]  scaffold00238
>PGM0100236.1 [Candida]  scaffold00239
>PGM0100236.1 [Candida]  scaffold00240
>PGM0100236.1 [Candida]  scaffold00241

I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file. Please give a simple command solution. I am new to bioinfo and linux script.

Thank you.

Fasta • 1.9k views
ADD COMMENT
0
Entering edit mode
awk '{print $3}' input > output
ADD REPLY
2
Entering edit mode

This solution also prints the words scaffold losing all other information.

What OP wants.

I would like to take my third column alone i.e scaffold00238 for all the headers in my fasta file

ADD REPLY
0
Entering edit mode

If your file only contains the headers and not the sequence, another easy solution is

cat my_file | cut -f3 > my_new_filtered_file

If it does contain the sequence then

cat my_file | grep ">" | cut -f3 > my_new_filtered_file

This assumes that the delimitator between columns is a tab (\t). If it is an empty space, you need to define the delimitator with a cut -d " " -f3

ADD REPLY
1
Entering edit mode

Neither of these solutions are doing what OP wants as far as I can tell.

OP wants to use a word to modify the header of a multi-fasta file.

ADD REPLY
0
Entering edit mode

palani : Please confirm that you want to change

>PGM0100236.1 [Candida] scaffold00238
AGCATCG

to

>scaffold00238
AGCATCG
ADD REPLY
0
Entering edit mode

Yes, exactly like that. Thanks for all the response. This is my first time in biostars. I am happy for all the suggestions. Thank you all.

ADD REPLY
0
Entering edit mode

Thank you all for your suggestions, I will try it. I am glad for all your support.

ADD REPLY
1
Entering edit mode
2.2 years ago
antmantras ▴ 80

Edit: Apologies, I thougth OP wanted only the names of the scaffolds. Then a solution could be:

awk '/^>/{$0=">"$NF}1' myfile.fasta > output.fasta

This will get the last field of the fasta headers.

ADD COMMENT
1
Entering edit mode

Congratulations, 2/3 of your commands qualify for the UUOC award!

ADD REPLY
1
Entering edit mode

Yeah, I know it can be written with:

grep ">" myfile.fasta | awk '{print $3}' > output.txt

if one is only looking for the names of the third column. However, I think is easier to understand for someone new to Unix what is going on with that command sequence (by first using cat). Anyways, since that is not what OP wanted, I removed that part.

ADD REPLY
1
Entering edit mode

That's a good reason to use a cat where it's not required (as the Wiki page says). I also use it when I'm "building" a piped command sequence as I often start out with head file | ... and then go back to the working command and replace head with cat, but here on the forum, you can skip the cat-ing as ultimately, people should learn better ways of using commands and while we don't need to be perl-like in complexity, we can avoid over-simplification as well.

ADD REPLY
1
Entering edit mode
2.2 years ago

A seqkit answer.

seqkit replace -p ".+(scaffold[0-9]+$)" -r "\$1" file.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6