I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.
So let's say I have a fasta file that looks like this...
>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT
I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:
>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT
The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.
Any insight would be much appreciated. Thank you!
EDIT: Solution found using bash
awk -v s=$RANDOM 'BEGIN {
if(s) # if seed provided as var s
srand(s) # use s as seed
else # or not
srand()
}
!/^>/ { # process non > starting records
i=int(0.1*(length($0))) # 10 % of length to remove
for(j=1;j<=i;j++) {
p=int(rand()*length($0))+1 # p the point from where to remove
$0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char
}
}
1' file_in.fa >> file_out.fa
maybe you can use
mutate.sh
from the bbmap packageI tried to use mutate.sh and it seems to have not only delete a portion of nucleotides, but also randomly shuffle nucleotides around. Regardless, I found an elegant solution using bash. See my edit if you're interested.
from bbmap
from seqkit
Correct me if I am wrong, but I believe bbmap's reformat.sh script only works for subsetting reads, not bases. So the command that you suggested would not work because my fasta file is a single read with many bases.
I believe the same applies to seqkit sample, where it only outputs the specified proportion of reads, not bases.