Question

Extract sequences consecutively from a large fasta files and save in new fasta files

2

Entering edit mode

5.1 years ago

Kumar ▴ 120

I would like to extract sequences consecutively from a large fasta files and save in new fasta files. For example, I have two fasta files namely species1.fasta and species2.fasta as given below,

species1.fasta
>tag1 
TGCAAGAG
>amylase 
ATCGAGAT

species2.fasta
>prot8
ATGTTTGCGA
>lip1
TGATATAGAT

and I need to extract first positioned sequences from both species1.fasta and species2.fasta and save in new file namely group1.fasta. Same manner, the second positioned sequences to be extracted and saved in group2.fasta, and the expected output should be like this,

group1.fasta
>tag1 
TGCAAGAG
>prot8
ATGTTTGCGA

group2.fasta
>amylase 
ATCGAGAT
>lip1
TGATATAGAT

Like this I have to do for 3000 sequences, manual process is impossible, therefore, I have tried the following script, wherein I could not loop the process. Therefore, please help me in this regard.

my script
for file in *.fasta do cat "$file" \   | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \  {printf("%s",$0);} \
 END {printf("\n");}' \   | awk '{if ( NR==1 ) print $0;}' \   | awk '{print $1"\n"$2}' | uniq > "new_${file}"

Thank you.

python perl shell r fasta • 2.3k views

ADD COMMENT • link updated 5.1 years ago by Chirag Parsania ★ 2.0k • written 5.1 years ago by Kumar ▴ 120

4

Entering edit mode

make a directory "output" in the same directory where all fasta files are located.
run following command:

$ parallel seqkit range -r {}:{} *.fasta -o output/group_{}.fasta ::: $(seq 1 $(grep -c ">" species1.fasta))

do not output fasta files in the same directory. Download seqkit tool from here: https://github.com/shenwei356/seqkit.

If you want to be careful, dry-run parallel command.

ADD REPLY • link 5.1 years ago by cpad0112 21k

1

Entering edit mode

Thank you @cpad0112. However, it shows error as follows,

ga@pls:~/Documents/data/xap95/ortho_cluster/orthogenes_dna/test$./seqkit range -r {}:{} *.fasta -o output/group_{}.fasta ::: $(seq 1 $(grep -c ">" species1.fasta))
[ERRO] invalid range: {}:{}. type "seqkit range -h" for more examples

ADD REPLY • link 5.1 years ago by Kumar ▴ 120

2

Entering edit mode

@ Dineshkumar K

I guess you forgot to use parallel. Try installing parallel and use the below function:

 parallel  'seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))

ADD REPLY • link 5.1 years ago by cpad0112 21k

0

Entering edit mode

@ cpad0112 I created a output folder and then ran the command as you mentioned, but there is no file has been created in output directory. The terminal printed as follows,

ga@pls:~/Desktop/int/tolc$ parallel --dry-run 'seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))

Academic tradition requires you to cite works you base your article on. When using programs that use GNU Parallel to process data for publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

seqkit range -r 1:1 *.fasta -o output/group_1.fasta seqkit range -r 2:2 *.fasta -o output/group_2.fasta

I think I have not properly installed seqkit, even I kept seqkit executive file in the concerned folder where the fasta files lies, despite it is not working.

ADD REPLY • link 5.1 years ago by Kumar ▴ 120

1

Entering edit mode

why did you append --dry-run to @cpad0112 s code? Could you retry running the command precisely as it was specified

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Thank you so much @cpad0112 and @russhh. It works awesome. I ran the following command,

ga@pls:~/Desktop/int/tolc$ parallel  './seqkit range -r {}:{} *.fasta -o output/group_{}.fasta' ::: $(seq 1 $(grep -c ">" species1.fasta))

and successfully got results in the output folder, The output is given below,

group1.fasta
>tag1 
TGCAAGAG
>prot8
ATGTTTGCGA

group2.fasta
>amylase 
ATCGAGAT
>lip1
TGATATAGAT

ADD REPLY • link 5.1 years ago by Kumar ▴ 120

0

Entering edit mode

Can you be sure there will be the same number of sequences in the two input fasta files each time you run this? Which of the languages that you've labelled this question do you have the most experience with?

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

@russhh, Yes, I have the same number of sequences in the two fasta file ( In reality, I have 30 fasta files and each fasta file consist of 3000 sequences in it. I need to extract position wise 1,2,3...3000 sequences from all the 30 fasta file and save in new files like group1.fasta, group2.fasta.....group3000.fasta). I am not having much exposure in progromming, but I googled many shell scripts and tried to compile and serve my purpose and ultimately failed. My shell script can extract sequences one after another from the 30 fasta files by changing the sequence position each time in (NR==1), But, manually I can not do this for 3000 sequence positions. Moreover, I need to save the same in new file consecutively, that is why I seek help in this regard.

ADD REPLY • link 5.1 years ago by Kumar ▴ 120

0

Entering edit mode

It sounds like you understand the processes involved in doing this, but your code is way too complicated. Have a look at the biopython SeqIO webpage: https://biopython.org/wiki/SeqIO .

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Thank you russhh, for your suggestion. I would try to do the same.

ADD REPLY • link 5.1 years ago by Kumar ▴ 120

1

Entering edit mode

5.1 years ago

Chirag Parsania ★ 2.0k

R way

sp1_fa  <- Biostrings::readBStringSet("species1.fasta")

sp2_fa  <- Biostrings::readBStringSet("species2.fasta")


purrr::map(seq_along(sp1_fa) ,  ~  Biostrings::writeXStringSet( filepath = paste("group" ,..1,".fasta",sep = "") ,
                                                                c(sp1_fa[..1] , sp2_fa[..1]) ) )

^{Created on 2019-11-14 by the reprex package (v0.3.0)}

edit : general solution, can be applied to regardless of file numbers

fasta_files <- list.files(path = "." ,pattern = "*.fasta" ,full.names = T)
n_files = length(seq_list)

seq_list <- fasta_files %>% 
        tibble::enframe() %>% 
        dplyr::mutate(seqs = purrr::map(value, ~ Biostrings::readBStringSet(..1))) %>%
        dplyr::pull(seqs)


for(i in c(1:n_files)){

        seq_list %>% 
                purrr::map( ~ purrr::pluck(..1,function(x)x[i])) %>%  
                Biostrings::BStringSetList() %>% 
                unlist() %>% Biostrings::writeXStringSet(paste("group" ,i,".fasta",sep = ""))

}

ADD COMMENT • link 5.1 years ago by Chirag Parsania ★ 2.0k

score 2 · Accepted Answer · 2019-11-11

A fairly naive approach (code not tested as I'm not at a computer to do so at the moment, so it will likely be wrong in some places, but the general approach should work)

from Bio import SeqIO
import sys

for file in sys.argv[1:]:
    for i, rec in enumerate(SeqIO.parse(file, 'fasta')):
        file  = open('./sequences_' + str(i), 'a')
        file.write(rec.format('fasta'))
        file.close()

This should iterate all records in all files, and write a new file with the number. If the file doesnt yet exist, it will create it, if it does, it will append to it.

Run as:

python scriptname.py /path/to/files/*.fasta