removing part of the fasta header from multifasta file
2
0
Entering edit mode
10.4 years ago
2kg2523 ▴ 10

I was trying to delete some words (strings) from sequence header from multifasta file. I want to eliminate len= and path=[...] so finally I would have only the seq identifier and the length

>comp2_c0_seq1 len=589 path=[1:0-588]
>comp2_c1_seq1 len=352 path=[1462:0-351]

What I want to have is the following in two column

>comp2_c0_seq1 589
>comp2_c1_seq1 352

Thank you very much

sequence rna-seq • 9.6k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
10.4 years ago
Juke34 8.9k

Hey,

If you have a Mac or linux you can use a bash command to do that:

IFS=$'\n'; for i in $(cat YOURFILE);do if [[ $i =~ ">" ]];then part1=$(echo $i | cut -d' ' -f1); part2=$(echo $i | cut -d' ' -f2);part2ok=$(echo $part2 | cut -d'=' -f2) ; echo "$part1 $part2ok" ;else echo $i; fi ;done >> outputFile

You have just to replace YOURFILE by the name of the multifasta file that you want to modify. The result will be in outputFile.

This is not the most effective but it should work.

ADD COMMENT
1
Entering edit mode
sed -e '/^>/s/len= //' -e /'^>/s/path.*//'
ADD REPLY
0
Entering edit mode

Thank you for recommendation but it does not work. It gave me the error message. Here is what I did.

sed -e '^>/s/len= //' -e '^>/s/path.*//' MultiFasta.txt > OutLength.txt

The error message

sed: -e expression #1, char 1: unknown command: `^'

Where did I go wrong? Thank you again.

ADD REPLY
1
Entering edit mode

I didn't even check my sed expression. forgot the prefix '/' before '^'

ADD REPLY
0
Entering edit mode

It worked with the following modifications

Original which does not work

sed -e '^>/s/len= //' -e '^>/s/path.*//' MultiFasta.txt > OutLength.txt

I deleted ^> and the space after len=, and added \ after s/. The working syntax is the following

sed -e 's/\len=//' -e 's/\path.*// MultiFasta.txt > OutLength.txt
ADD REPLY
0
Entering edit mode

Correction: Must add ' to the end of 's/\path.*// to make it 's/\path.*//'

Working syntax is the following:

sed -e 's/\len=//' -e 's/\path.*//' MultiFasta.txt > OutLength.txt
ADD REPLY
0
Entering edit mode
10.4 years ago

I like Pierre's solution using sed. Here is a Python solution:

You can install pyfaidx using "pip install --user pyfaidx"

ADD COMMENT

Login before adding your answer.

Traffic: 1712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6