Hi,
I want to generate a list of random genomic sequences that match GC-content and length to my input DNA sequences. Note that my input sequences are not of the same length. My expected output will be the same number of random sequences that match length and GC-content to the individual input sequence.
>seq1
AACAATGAGTTGTCATTTTATCCAAATCTAAAAAAAAACATACATACAGTATCAAGTTCTGGTTAAAGTAGTAGATGTACTTTACGTGGTACATAAGAGTATCTATACAGTTTGTGGG
>seq2
atacaacctgaaaaacagagagggaaaaataattaagaaaaatgaatagaagcttgggacatccttaatgagagtaccagaggagatgtgagagagaaagaagcagaaaaaaagttccaagcagaaaaaatgtccaagaaataatagctgaaaagctcccaaatttgctgaaaaatgttaacctacacattcaagaagcccaaaaagctctacacaaagagacacacctagacacacacacctggacacacacacctgga
>seq3
cacctaagctggagtgcagtggtgcaattacagcttactgcagcctcaatctcccaggcttaagggatcctcccatgtagctgagactacaggcatgagccactatgtccagctaatttttaaattttttgtagagacagggtctcgctaccttgaacgggctgaccttgaactcctgggctctggtggccctcctgtgttgacctcccaaagcattgggattacaggcatgagccactgcacccagccTAGAAGCTCTGTTCATATTTATTT
>seq4
CATGAGGCCCAGTCTGTGAACAGAGACCAGGTCTAACCCCTTCTTCCAGGAAAGCCTCGTAGGGCCTTCTGGCCAAGAGGCCACGagtggtgaagactgcagactctgaaatcagaaatacctgggctccactgtcagcatggcagctgaggaagagtgaaaattcctctaagttcttttagaagtcccagcctccccatgtaactggggaACTGATGGGAGGAGCAGAGCTGTCTGTGCACATAAGAAGTTCTCAGTAAATGGAGACAGTTACTATTTCTGTTATTATTGAATTTGAACAAATTCCCTGGGTATGTGTGGGGGGACACTTCAGGTGAAAACACGCCCCTCCTCCCCTGGTGCGGGGGCCTGTGCTGCCACCCTCTGGAAGCCTGCAGAGGGGCAGGGAAAACAGACCCTGAACAAAAGTGTGCACCCAGTGAGGAGGTGCAAGGGCACAAAGGTGGCACCAAGTGCCTCAAGGAGAGGCTGAAACGCGGCCTGGGGACCTCGCAGTGGTCTGGTCATATAGGC
>seq5
GGCATGTGGTGTCAGCAGAGGTGCCTCAAGGATAGAGTGAGTCCAGAGTCTAGAAAGGAGCAGATCACCAGGCTCTGGGAAGAGCACAGCATGGGTGCACACACTGCTCTACCCAGCATGGCTGCCGACCCAAAGACAGCAAAGCCAAGAAGGACACACAAGCGTGGCCAGATGCAGCCCTGTGAGGAAACTTACCCAAGAACGGGACGATGGGCTTGAGAAACCatccatctacaaggatggcgtttgctgcagcaatgtttataataaattgtgggaaactgtgaactgcctaaatgtctcacaataggaacaaattagtgcaccacaccatgaaactctctacagcTCCTGAGTTACAGAACGACAGTATAATACTA
Could anyone please suggest a proper tool for this purpose?
btw, I have looked into NullSeq, GC_compo, and BiasAway. But none of them fits my needs.
Thanks,
-Xianjun
Take a look at
randomreads.sh
from BBMap suite. It should produce fasta files.Thanks for the reply. I looked into the option of randomreads.sh, right it can produce fasta files, but it doesn't match the length distribution and GC content.
You may need to write something yourself perhaps since you have a specific need.
Sure. I thought there might already be some tools available since it's a common task when people are asked to compare to a length- and GC-matched background set, e.g. TFBS enrichment analysis.