Question

randomly sample sequences of fasta file per sample name

0

Entering edit mode

6.6 years ago

nmkn ▴ 10

I have a task that I'm sure has been done before but I can't find a simple solution. Given a fasta file, with multiple sequences per species/individual name, I want to randomly sample one sequence per species/individual. Not each species/individual will always have the same number of sequences.

I've found a few similar posts (https://www.biostars.org/p/18831/), but they randomly sample a subset based on a given percentage. I think I want to do something similar to this post, but that was never fully answered: Help with randomly sampling from a fasta file.

Here's my example:

>SpeciesA.seq0   CCACTTTA......
>SpeciesA.seq1   CCTCTTTA......
>SpeciesA.seq2   CCGCTTTA......
>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq0   GCCCTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesB.seq2   ACCCTTTA.....
>SpeciesB.seq3   GCCCTTTA.....
>SpeciesC.seq0   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

I want to randomly select one sequence per species/individual so the output will look like this:

>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

Ideally, the resulting sequences will be concatenated/pasted into a new file with the same name as the starting input fasta file (which corresponds to a gene name). I only have limited bash experience so help would be greatly appreciated!

fasta random sequence • 3.9k views

ADD COMMENT • link updated 6.6 years ago by cpad0112 21k • written 6.6 years ago by nmkn ▴ 10

0

Entering edit mode

If an answer below was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

6.6 years ago

Pierre Lindenbaum 164k

shuffle your linearized fasta with shuf
convert the dot to tab with tr
use sort one the first column using --stable and -u (unique)

convert back the first tab to dot

 shuf linearized.tsv  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./'

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you, but I'm not sure if this does what I want. First, I have standard fasta files not .tsv files. Also, how does this randomly select one sequence per sample name and paste into a new file? I apologize if i just don’t understand.

ADD REPLY • link 6.6 years ago by nmkn ▴ 10

0

Entering edit mode

First, I have standard fasta files not .tsv files

linearize as described in the other posts. see https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

Also, how does this randomly select one sequence per sample name

this is the aim of the -u flag in sort

and paste into a new file?

do you want into a new file. use redirection: https://stackoverflow.com/questions/876239/how-can-i-redirect-and-append-both-stdout-and-stderr-to-a-file-with-bash

all in one:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa | shuf  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./' | tr "\t" "\n" > out.fa

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

6.6 years ago

cpad0112 21k

with seqkit and bash (Assuming that input is in fasta format, not in tsv as in OP and fasta records are linearized):

for i in $(seqkit seq -n test.fa | cut -f1 -d"." | uniq);
    do seqkit grep -rp $i test.fa  | seqkit seq -n | shuf -n1 | grep -A1 -f - test.fa;
done;

with seqkit, datamash and sed:

 $ seqkit fx2tab test.fa | sed 's/\./\t/' | datamash -sg1 rand 2,3 | sed 's/\t/\./' | seqkit tab2fx

with awk, sed and datamash:

$ awk 'BEGIN{RS=">"}{print $1"\t"$2}' test.fa |sed -e '1d;s/\./\t/g'| datamash -sg1 rand 2,3 | sed 's/\t/\./' |  awk 'BEGIN{RS="\n"}{print ">"$1"\n"$2}

ADD COMMENT • link 6.6 years ago by cpad0112 21k

score 3 · Accepted Answer · 2018-04-09

3

Entering edit mode

6.6 years ago

JC 13k

Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs =();

$/="\n>";
while (<>) {
    s/>//g;
    my ($name, @seq) = split (/\n/, $_);
    my ($sp, $id) = split (/\./, $name);
    push @{ $seqs{$sp} }, join "\n", ">$name", @seq;
}

foreach my $sp (keys %seqs) {
    print $seqs{$sp}[ int( rand @{ $seqs{$sp} }) ], "\n";
}

Examples:

perl rand.pl < seqs.fa
>SpeciesC.seq0
GCCCTTTA.....
>SpeciesA.seq0
CCACTTTA......
>SpeciesB.seq1
GCCCTTTA.....
$ perl rand.pl < seqs.fa
>SpeciesB.seq0
GCCCTTTA......
>SpeciesC.seq1
GCCCTTTA.....
>SpeciesA.seq1
CCTCTTTA......

Explanation: use a hash of arrays, each species collects their sequences in an array, then just iterate over species getting one random value of the array length.

ADD COMMENT • link 6.6 years ago by JC 13k

0

Entering edit mode

Thank you very much! This perl option works well and I think will be the most efficient for many files. Just a follow up - how can I get this to work as a loop since I have hundreds of fasta files? Sorry if that wasn't clear in the original post. I'm having issues:

for file in *.fasta; do (perl rand.pl<"$file")>"$file_new"; done

Thanks

ADD REPLY • link 6.6 years ago by nmkn ▴ 10

1

Entering edit mode

that bash line should work, just define "file_new" or change the name:

for file in *fasta; do perl rand.pl < $file > ${file%.fasta}_subset.fasta; done

ADD REPLY • link 6.6 years ago by JC 13k

0

Entering edit mode

That solves it and it was very fast - thank you very much! And thank you everyone else for your answers too.

ADD REPLY • link 6.6 years ago by nmkn ▴ 10