Question

Extract sequences from fastA files based on csv chart in R

0

Entering edit mode

5.4 years ago

shelley.w.peterson ▴ 10

I have 2 fasta files - we'll call them A and B, each with several thousand sequences labelled A1, A2, A3 or B1, B2, B3 etc.

I also have a .csv file that has 3 columns:

ST1   A1   B1
ST2   A1   B2
ST3   A3   B1

I need to combine these all into a single fasta file that looks like this:

ST1
"A1DNASEQUENCE""B1DNASEQUENCE"
ST2
"A1DNASEQUENCE""B2DNASEQUENCE"
ST3
"A3DNASEQUENCE""B1DNASEQUENCE"

I'm not even sure where to start with this one... anyone have any advice?

Thanks in advance!

R sequencing • 2.3k views

ADD COMMENT • link 5.4 years ago by shelley.w.peterson ▴ 10

0

Entering edit mode

I guess that should be solvable by writing a script in biopython utilising the seqIO library. Alternatively, you can try it with command line tools like awk, cut, fasgrep, and join to loop over you CSV file.

ADD REPLY • link 5.4 years ago by michael.ante ★ 4.0k

0

Entering edit mode

I realized I forgot to state that I specifically need this for R. I've actually already written this for perl and it works great, but I'm trying to rewrite it for R for some of my labmates who have trouble with command line programs. The ultimate goal is to rewrite all of my perl scripts to be used with R SHINY, but I'm not familiar enough with R yet to know all of the available methods of data manipulation and what I did in perl won't work in R.

ADD REPLY • link 5.4 years ago by shelley.w.peterson ▴ 10

0

Entering edit mode

Apparently, there is an R-package for reading and accessing fasta files seqinr.

Once having the fasta as object, you it should be easy to apply your Perl routines.

ADD REPLY • link 5.4 years ago by michael.ante ★ 4.0k

score 1 · Answer 1 · 2019-12-19

A way by shell command, the idea is first get the ST index for each AX and BX sequence, then concatenate them according to the ST index.

test1.fa:

>A1
A1SEQUENCE
>A2
A2SEQUENCE
>A3
A3SEQUENCE

test2.fa:

>B1
B1SEQUENCE
>B2
B2SEQUENCE
>B3
B3SEQUENCE

test.csv (tab-delimited):

ST1 A1  B2
ST2 A2  B1
ST3 A2  B2
ST4 A3  B3

script:

# Join each fasta file with csv file to get the "ST" index of each sequence
join -t $'\t' -1 1 -2 2 -o 2.1,1.2 <(awk 'NR%2  {sub(/>/, "", $0); printf "%s\t", $0; next}1' test1.fa | sort -k1) <(sort -k2 test.csv) | sort -k3 > join.intermediate1
join -t $'\t' -1 1 -2 3  -o 2.1,1.2 <(awk 'NR%2  {sub(/>/, "", $0); printf "%s\t", $0; next}1' test2.fa | sort -k1) <(sort -k3 test.csv)  | sort -k3 > join.intermediate2
# Join the sequence with the same "ST" index
join -t $'\t' -j 1 join.intermediate1  join.intermediate2 | sed 's/^/>/' |sed 's/\t/\n/' | sed 's/\t//' > output.fa
# rm intermediate file
rm join.intermediate1 join.intermediate2

output.fa:

>ST1
A1SEQUENCEB2SEQUENCE
>ST2
A2SEQUENCEB1SEQUENCE
>ST3
A2SEQUENCEB2SEQUENCE
>ST4
A3SEQUENCEB3SEQUENCE

In addition, I agree with @michael.ante that you can also do this by python, for example, you can generate a dictionary with name and sequence as key and value, then concatenate the corresponding AX BX sequence for each STX. And it will be more readable, you will quickly forget what you did if you use shell like the above