Exacting all the headers in a fasta file using R
4
0
Entering edit mode
5.6 years ago
xuenanwang • 0

Hi, I want to extract all the headers from my fasta file. Here is my example:

>Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Kryptoperidiniaceae;Unruhdinium;Unruhdinium_kevei;
ATGCTTGTCTCAAAGATTAAGCCA......

All I want is extracting the line starting with the ">", and separate each name (which before the ";") into different columns, and put them into a CSV file.

I know really know how to do, and I really need some help!

R sequence • 3.8k views
ADD COMMENT
1
Entering edit mode
5.6 years ago
ggman ▴ 90
grep "^>" <filename> | sed 's/;/,/g' > <newfilename>

Command line answer. Grep will search for ">" and sed will substitute ";" with a tab creating new columns. the last ">" will output your results to the new file name you indicated.

ADD COMMENT
1
Entering edit mode

OP wants comma-separated output. You may want to amend your solution accordingly.

ADD REPLY
0
Entering edit mode

Yikes, amended to be CSV

ADD REPLY
0
Entering edit mode

It will leave the initial > in. If that is not wanted then it can be removed by an extension of solution above.

$ grep "^>" <filename> | sed -e 's/>//' -e 's/;/,/g' > <new_file>
ADD REPLY
1
Entering edit mode
5.6 years ago
GenoMax 147k

Not a solution in R but you can simply do

$ grep "^>" your_file.fa | awk -F ">|;" '{for(i=2;i<NF;i++){printf "%s,", $i}; printf "\n"}'
Eukaryota,Alveolata,Dinoflagellata,Dinophyceae,Peridiniales,Kryptoperidiniaceae,Unruhdinium,Unruhdinium_kevei,
ADD COMMENT
0
Entering edit mode
5.6 years ago

sed

 sed '/^>/s/;/\t/g;/^[^>]/d;s/^>//' in.fasta
ADD COMMENT
0
Entering edit mode

OP wants comma separated output so

$ sed '/^>/s/;/,/g;/^[^>]/d;s/^>//' < in.fasta > out.header
ADD REPLY
0
Entering edit mode
5.6 years ago
zx8754 12k

Good bash solutions, that could be wrapped inside R for example as below:

library(data.table)
x <- fread("grep ... myFilename.fasta")

Or do all within R:

#example input fasta
x <- read.table(text = "
>seq0;x1;y1
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1;x22
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
", sep = ";", fill = TRUE, header = FALSE)

# keep only header rows
x <- x[ grep("^>", x$V1),  ]

# remove ">"
x$V1 <- gsub(">", "", x$V1, fixed = TRUE)

# output
write.csv(x, "myFile.csv")

myFile.csv

"","V1","V2","V3"
"1","seq0","x1","y1"
"3","seq1","x22",""
"6","seq2","",""
ADD COMMENT

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6