I would like to extract sequences consecutively from a large fasta files and save in new fasta files. For example, I have two fasta files namely species1.fasta and species2.fasta as given below,
species1.fasta
>tag1
TGCAAGAG
>amylase
ATCGAGAT
species2.fasta
>prot8
ATGTTTGCGA
>lip1
TGATATAGAT
and I need to extract first positioned sequences from both species1.fasta
and species2.fasta
and save in new file namely group1.fasta
. Same manner, the second positioned sequences to be extracted and saved in group2.fasta
, and the expected output
should be like this,
group1.fasta
>tag1
TGCAAGAG
>prot8
ATGTTTGCGA
group2.fasta
>amylase
ATCGAGAT
>lip1
TGATATAGAT
Like this I have to do for 3000 sequences, manual process is impossible, therefore, I have tried the following script, wherein I could not loop the process. Therefore, please help me in this regard.
my script
for file in *.fasta do cat "$file" \ | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \ {printf("%s",$0);} \
END {printf("\n");}' \ | awk '{if ( NR==1 ) print $0;}' \ | awk '{print $1"\n"$2}' | uniq > "new_${file}"
Thank you.
run following command:
$ parallel seqkit range -r {}:{} *.fasta -o output/group_{}.fasta ::: $(seq 1 $(grep -c ">" species1.fasta))
do not output fasta files in the same directory. Download seqkit tool from here: https://github.com/shenwei356/seqkit.
If you want to be careful, dry-run parallel command.
Thank you @cpad0112. However, it shows error as follows,
@ Dineshkumar K
I guess you forgot to use parallel. Try installing parallel and use the below function:
@ cpad0112 I created a
output
folder and then ran the command as you mentioned, but there is no file has been created inoutput
directory. The terminal printed as follows,Academic tradition requires you to cite works you base your article on. When using programs that use GNU Parallel to process data for publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
To silence this citation notice: run 'parallel --citation'.
seqkit range -r 1:1 *.fasta -o output/group_1.fasta seqkit range -r 2:2 *.fasta -o output/group_2.fasta
I think I have not properly installed
seqkit
, even I keptseqkit
executive file in the concerned folder where the fasta files lies, despite it is not working.why did you append
--dry-run
to @cpad0112 s code? Could you retry running the command precisely as it was specifiedThank you so much
@cpad0112
and @russhh. It works awesome. I ran the following command,and successfully got results in the output folder, The output is given below,
Can you be sure there will be the same number of sequences in the two input fasta files each time you run this? Which of the languages that you've labelled this question do you have the most experience with?
@russhh, Yes, I have the same number of sequences in the two fasta file ( In reality, I have 30 fasta files and each fasta file consist of 3000 sequences in it. I need to extract position wise 1,2,3...3000 sequences from all the 30 fasta file and save in new files like group1.fasta, group2.fasta.....group3000.fasta). I am not having much exposure in progromming, but I googled many shell scripts and tried to compile and serve my purpose and ultimately failed. My shell script can extract sequences one after another from the 30 fasta files by changing the sequence position each time in
(NR==1)
, But, manually I can not do this for 3000 sequence positions. Moreover, I need to save the same in new file consecutively, that is why I seek help in this regard.It sounds like you understand the processes involved in doing this, but your code is way too complicated. Have a look at the biopython SeqIO webpage: https://biopython.org/wiki/SeqIO .
Thank you russhh, for your suggestion. I would try to do the same.