Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?
Something like:
command -size 10 -in lotsofseqs.fasta -out sample.fasta
Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?
Something like:
command -size 10 -in lotsofseqs.fasta -out sample.fasta
linearize your sequences, generate a random order with shuf
and print the 'N' first lines.
cat input.fasta |\
awk '/^>/ { if(i>0) printf("\n"); i++; printf("%s\t",$0); next;} {printf("%s",$0);} END { printf("\n");}' |\
shuf |\
head -n 2 |\
awk '{printf("%s\n%s\n",$1,$2)}'
I used to provide an answer at seqanswers.
You need bioawk for it to work. With bioawk, you can:
awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz
This solution is based on reservoir sampling. It is optimal in terms of both time and space complexity and works with an arbitrarily large input file. Most other solutions need to read all data into memory.
Of course you can also use perl/python to re-implement the awk line. I use bioawk only because it has a built-in fasta/fastq parser.
No, this is not true. Reservoir sampling may seem too simplistic, but it guarantees uniform randomness. You can find the proof via google. It is theoretically the best solution for sampling a fixed number of records without replacement. You even do not need to know the size of reservior.
The first time I read about reservoir sampling is when I saw that seqanswers post. By intuition, I think there must be a solution working on a data stream - those who work on endless time series must be faced with such a situation all the time. Nonetheless, I did not know the answer, so I googled and found reservior sampling.
The thing about reservoir sampling is that there is 'decreasing probability', once the k first elements have been filled, that a new sequence will be retained, thus making sequences at the end of the file less likely to be taken and sequences from the beginning of the file more likely to be replaced. How close to random selection will the resulting list of sequences really be? I provide a solution that works with gigantic files, at the cost of having to scan the file twice and doing some extra trimming but ensuring that everybody gets equal chances.
This page has lots of useful sequence utilities, including a script named random_sequence_reads.pl (zip or tarball to download). It does sampling with/without replacement and seems to work, based on my quick test.
Note: the test.sh script samples all sequences from the input file which is confusing and not useful - edit it to change the parameters.
This can be done with Biopieces like this:
read_fasta -i lotsofseqs.fasta | random_records -n 10 | write_fasta -o sample.fasta -x
Here is something quick and dirty in python:
import sys,random
from Bio import SeqIO
num = int(sys.argv[2])
outArray = [record for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta') if random.randint(0,1)]
while len(outArray)!= num:
newArray = []
for record in outArray:
if len(newArray) == num:
break
if random.randint(0,1):
newArray.append(record)
outArray = newArray
for record in outArray:
print ">" + record.id
print record.seq
Use by:
python myScript.py in.fasta 10 > subset10.fa
It basically iteratively randomly takes a subset (around half) until it gets to the number of sequences you want. So if you have 1000 sequences, it will take randomly around 500 of that, then 250 of that, 125....so on the so forth until it gets to the number if you want.
This script will probably only work well if you have a large amount of sequences (1000+) and the subset you want is well below half of the total amount.
A little late to the fray, but anyway...
Here's a solution for sampling fasta without replacement.
It uses bash functions, and GNU coreutils' shuf command, but does not suffer the memory requirements of keeping the sequences in memory since it shuffles only the byte offsets into the file where the sequences begin.
function fastaSubsample () {
## PURPOSE: choose [HTML] random records from a (presumably
## well-formed) fasta file in [HTML]. The output sequences
## are in same order as input (due to the `sort -n`)
n="$1"
path="$2"
shift 2
fastaBreaks ${path} | shuf -n ${n} | sort -n | fastaRecordsAt ${path}
}
function fastaBreaks () {
## PURPOSE: output the byte positions of each fasta record on STDIN
perl -ne 'BEGIN{$told=0}; print($told.qq{\n}) if m/^>/;$told=tell();' "$@"
}
function fastaRecordsAt () {
## PURPOSE: extract the fasta records from# [HTML] starting at
## each byte offset appearing on STDIN.
path="$1"
shift 1
perl -anse 'BEGIN{open(PATH,$path)}; chomp; $told = $_; seek(PATH,$told,0); print(scalar(readline(PATH))); while (($l=readline(PATH)) && ($l !~ m/^>/)) {print($l)} ' -- -path=$path
}
Here is another Python solution, useful for processing multiple file simultaneously (must be in same dir):
import random
import glob
inFiles = glob.glob('*.fas')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))
for i in range(len(inFiles)):
for k in range(3):
fNames = []
fSeqs = []
outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fasta', 'wt'))
for line in open(inFiles[i]):
if (line[0] == '>'):
fNames.append(line)
else:
fSeqs.append(line)
curr = (len(outFiles)-1)
for j in range(num):
a = random.randint(0, (len(fNames)-1))
outFiles[curr].write(fNames.pop(a))
outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")
Basically you just enter how large the subsample should be, and it will give you three random subsamplings (without replacement). Of course if you only need one then only use one. If the file has less sequences than request this script will die.
I found this useful in a 16S tagged pyrosequencing project I've been working on.
I offer one solution here to sample without replacement, which does not require any subsampling and uses roughly 24 bytes per FASTA sequence.
The tradeoff is that one has to read through the file twice, the first time to read in byte offsets and a second and subsequent time for each random sample. But it isn't necessary to read in any of the FASTA set into memory, just line offsets.
You can use reservoir sampling which is O(N) and does it in one pass (quite handy for millions of Illumina paired-end reads). I have written one-liners to do that in awk:
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
# Use: python scriptname.py number_of_random_seq infile.fasta outfile.fasta
infile = sys.argv[2] #Name of the input file
seq = list(SeqIO.parse(infile,"fasta")) #Create a list with all the sequence records
print "Input fasta file = ", infile
totseq = len(seq) #Total number of sequences in the input file
print "Number of sequences in the original file = ", len(seq)
randseq = int(sys.argv[1]) #Number of random sequences desired
print "Number of random sequences desired = ", randseq
if randseq > totseq:
print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
exit()
outfile = sys.argv[3] #Name of the output file
print "Output fasta file = ", outfile
outrandseq = []
outlist = []
print "Randomly chosen output sequences:"
for i in range(randseq):
choose = random.randint(1,totseq-1) #Choose a random sequence record number
for j in range(len(outrandseq)): #Test to see if the random sequence record number has already been chosen
if choose == outrandseq[j]:
choose = random.randint(1,totseq-1) #Choose a new random sequence record number if the current one has already been chosen
outrandseq.append(choose)
print choose
outseq = seq[choose]
outlist.append(outseq) #Append seq record to output list
SeqIO.write(outlist, outfile, "fasta") #Write the output list to the outfile
exit()
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine.
# For very large sequence sets it may be too slow -- I just have not tried.
# Use: python scriptname.py number_of_random_seq infile.fasta outfile.fasta
infile = sys.argv[2] #Name of the input file
seq = list(SeqIO.parse(infile,"fasta")) #Create a list with all the sequence records
print "Input fasta file = ", infile
totseq = len(seq) #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq
numrandseq = int(sys.argv[1]) #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq
if numrandseq > totseq:
print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
exit()
outfile = sys.argv[3] #Name of the output file
print "Output fasta file = ", outfile
outrandseqset = []
i = 1
for i in range(numrandseq): #Create a list of random sequence record numbers for output
choice = random.randint(1,totseq)
outrandseqset.append(choice)
i = 1
j = 1
duplicate = 1
while duplicate: #Make sure no sequences are duplicated in the list
duplicate = 0
for i in range(numrandseq):
for j in range(i+1, numrandseq):
if outrandseqset[i] == outrandseqset[j]:
outrandseqset[j] = random.randint(1,totseq)
duplicate = 1
i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
print outrandseqset[i]
outlist = []
i = 1
for i in range(numrandseq): #Create the list of seq records to be written to the output file
seqnum = outrandseqset[i]
outseq = seq[seqnum]
outlist.append(outseq)
SeqIO.write(outlist, outfile, "fasta") #Write the output list to the outfile
exit()
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I know someone is going to give this shuf solution. :) A problem with it is that shuf loads all the data into memory, which makes it not suitable for very large files. If we only downsample a small file from large one, my solution is much better. In addition, if the input parameter is a fraction rather than a fixed number, hashing will be even better.
shuf uses reservoir sampling (bounded memory) since v8.22
I used it for one of my datasets and I've found that to make it more kind of "accurate" it would be better to add
BEGIN{FS="\t"}
in the last awk:...
That way if the fasta header is longer, with at least one space, like:
after shuffling, it won't become:
Hope it helps
shuf is not native to Macs.
maasha, if you want to install them, and use
brew
, then brew install xz coreutils will get youshuf
(defined as gshuf)that's it! no dependencies, bio packages, just basic commands. (somehow i never used shuf before.)
In place of shuf, you can do this on a Mac:
A slight variation of the linearization with
awk
, explained in more detail: https://riptutorial.com/bioinformatics/example/14692/linearize-a-fasta-sequence-with-awkDISCLAIMER: This code and its explanation is elaborated by someone else. I just add it as supplementary information to further understand the best answer provided to the question