Question

Exacting all the headers in a fasta file using R

0

Entering edit mode

6.3 years ago

xuenanwang • 0

Hi, I want to extract all the headers from my fasta file. Here is my example:

>Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Kryptoperidiniaceae;Unruhdinium;Unruhdinium_kevei;
ATGCTTGTCTCAAAGATTAAGCCA......

All I want is extracting the line starting with the ">", and separate each name (which before the ";") into different columns, and put them into a CSV file.

I know really know how to do, and I really need some help!

R sequence • 4.3k views

ADD COMMENT • link updated 6.3 years ago by zx8754 12k • written 6.3 years ago by xuenanwang • 0

score 1 · Answer 1 · 2019-04-19

1

Entering edit mode

6.3 years ago

ggman ▴ 90

grep "^>" <filename> | sed 's/;/,/g' > <newfilename>

Command line answer. Grep will search for ">" and sed will substitute ";" with a tab creating new columns. the last ">" will output your results to the new file name you indicated.

ADD COMMENT • link 6.3 years ago by ggman ▴ 90

1

Entering edit mode

OP wants comma-separated output. You may want to amend your solution accordingly.

ADD REPLY • link 6.3 years ago by GenoMax 153k

0

Entering edit mode

Yikes, amended to be CSV

ADD REPLY • link 6.3 years ago by ggman ▴ 90

0

Entering edit mode

It will leave the initial > in. If that is not wanted then it can be removed by an extension of solution above.

$ grep "^>" <filename> | sed -e 's/>//' -e 's/;/,/g' > <new_file>

ADD REPLY • link 6.3 years ago by GenoMax 153k

score 1 · Answer 2 · 2019-04-19

1

Entering edit mode

6.3 years ago

GenoMax 153k

Not a solution in R but you can simply do

$ grep "^>" your_file.fa | awk -F ">|;" '{for(i=2;i<NF;i++){printf "%s,", $i}; printf "\n"}'
Eukaryota,Alveolata,Dinoflagellata,Dinophyceae,Peridiniales,Kryptoperidiniaceae,Unruhdinium,Unruhdinium_kevei,

ADD COMMENT • link 6.3 years ago by GenoMax 153k

score 0 · Answer 3 · 2019-04-19

0

Entering edit mode

6.3 years ago

Pierre Lindenbaum 166k

sed

 sed '/^>/s/;/\t/g;/^[^>]/d;s/^>//' in.fasta

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

OP wants comma separated output so

$ sed '/^>/s/;/,/g;/^[^>]/d;s/^>//' < in.fasta > out.header

ADD REPLY • link 6.3 years ago by GenoMax 153k

score 0 · Answer 4 · 2019-04-23

Good bash solutions, that could be wrapped inside R for example as below:

library(data.table)
x <- fread("grep ... myFilename.fasta")

Or do all within R:

#example input fasta
x <- read.table(text = "
>seq0;x1;y1
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1;x22
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
", sep = ";", fill = TRUE, header = FALSE)

# keep only header rows
x <- x[ grep("^>", x$V1),  ]

# remove ">"
x$V1 <- gsub(">", "", x$V1, fixed = TRUE)

# output
write.csv(x, "myFile.csv")

myFile.csv

"","V1","V2","V3"
"1","seq0","x1","y1"
"3","seq1","x22",""
"6","seq2","",""