randomly sample sequences of fasta file per sample name
3
0
Entering edit mode
6.6 years ago
nmkn ▴ 10

I have a task that I'm sure has been done before but I can't find a simple solution. Given a fasta file, with multiple sequences per species/individual name, I want to randomly sample one sequence per species/individual. Not each species/individual will always have the same number of sequences.

I've found a few similar posts (https://www.biostars.org/p/18831/), but they randomly sample a subset based on a given percentage. I think I want to do something similar to this post, but that was never fully answered: Help with randomly sampling from a fasta file.

Here's my example:

>SpeciesA.seq0   CCACTTTA......
>SpeciesA.seq1   CCTCTTTA......
>SpeciesA.seq2   CCGCTTTA......
>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq0   GCCCTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesB.seq2   ACCCTTTA.....
>SpeciesB.seq3   GCCCTTTA.....
>SpeciesC.seq0   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

I want to randomly select one sequence per species/individual so the output will look like this:

>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

Ideally, the resulting sequences will be concatenated/pasted into a new file with the same name as the starting input fasta file (which corresponds to a gene name). I only have limited bash experience so help would be greatly appreciated!

fasta random sequence • 3.9k views
ADD COMMENT
0
Entering edit mode

If an answer below was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
3
Entering edit mode
6.6 years ago
JC 13k

Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs =();

$/="\n>";
while (<>) {
    s/>//g;
    my ($name, @seq) = split (/\n/, $_);
    my ($sp, $id) = split (/\./, $name);
    push @{ $seqs{$sp} }, join "\n", ">$name", @seq;
}

foreach my $sp (keys %seqs) {
    print $seqs{$sp}[ int( rand @{ $seqs{$sp} }) ], "\n";
}

Examples:

perl rand.pl < seqs.fa
>SpeciesC.seq0
GCCCTTTA.....
>SpeciesA.seq0
CCACTTTA......
>SpeciesB.seq1
GCCCTTTA.....
$ perl rand.pl < seqs.fa
>SpeciesB.seq0
GCCCTTTA......
>SpeciesC.seq1
GCCCTTTA.....
>SpeciesA.seq1
CCTCTTTA......

Explanation: use a hash of arrays, each species collects their sequences in an array, then just iterate over species getting one random value of the array length.

ADD COMMENT
0
Entering edit mode

Thank you very much! This perl option works well and I think will be the most efficient for many files. Just a follow up - how can I get this to work as a loop since I have hundreds of fasta files? Sorry if that wasn't clear in the original post. I'm having issues:

for file in *.fasta; do (perl rand.pl<"$file")>"$file_new"; done

Thanks

ADD REPLY
1
Entering edit mode

that bash line should work, just define "file_new" or change the name:

for file in *fasta; do perl rand.pl < $file > ${file%.fasta}_subset.fasta; done
ADD REPLY
0
Entering edit mode

That solves it and it was very fast - thank you very much! And thank you everyone else for your answers too.

ADD REPLY
0
Entering edit mode
6.6 years ago
  • shuffle your linearized fasta with shuf
  • convert the dot to tab with tr
  • use sort one the first column using --stable and -u (unique)
  • convert back the first tab to dot

     shuf linearized.tsv  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./'
    
ADD COMMENT
0
Entering edit mode

Thank you, but I'm not sure if this does what I want. First, I have standard fasta files not .tsv files. Also, how does this randomly select one sequence per sample name and paste into a new file? I apologize if i just don’t understand.

ADD REPLY
0
Entering edit mode

First, I have standard fasta files not .tsv files

linearize as described in the other posts. see https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

Also, how does this randomly select one sequence per sample name

this is the aim of the -u flag in sort

and paste into a new file?

do you want into a new file. use redirection: https://stackoverflow.com/questions/876239/how-can-i-redirect-and-append-both-stdout-and-stderr-to-a-file-with-bash

all in one:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa | shuf  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./' | tr "\t" "\n" > out.fa
ADD REPLY
0
Entering edit mode
6.6 years ago

with seqkit and bash (Assuming that input is in fasta format, not in tsv as in OP and fasta records are linearized):

for i in $(seqkit seq -n test.fa | cut -f1 -d"." | uniq);
    do seqkit grep -rp $i test.fa  | seqkit seq -n | shuf -n1 | grep -A1 -f - test.fa;
done;

with seqkit, datamash and sed:

 $ seqkit fx2tab test.fa | sed 's/\./\t/' | datamash -sg1 rand 2,3 | sed 's/\t/\./' | seqkit tab2fx

with awk, sed and datamash:

$ awk 'BEGIN{RS=">"}{print $1"\t"$2}' test.fa |sed -e '1d;s/\./\t/g'| datamash -sg1 rand 2,3 | sed 's/\t/\./' |  awk 'BEGIN{RS="\n"}{print ">"$1"\n"$2}
ADD COMMENT

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6