I am looking to generate artificial single-end Illumina sequencing data from a FASTA containing a number of short sequences representing unique molecular identifiers. However, I've found limited information on this online and need some help. The start of the input FASTA resembles this (in practice there will be many more sequences):
>1 dna:chromosome
ACCTTAGCAGGT
TCGAACCTGTGA
CGAGTCAGTCTA
And the desired output file is a FASTQ file resembling this:
@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
TCGAACCTGTGA
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CGAGTCAGTCTA
IIIIIIIIIIII
Importantly for me, each UMI sequence read needs to be 12 bases long, sequenced in order (i.e. starting with the first UMI in the FASTA) and without overlap, i.e. every UMI is read once and each read in the FASTQ is a mirror of the sequence in the FASTA. Basically, I need a version of the FASTA input file with Phred quality scores. I'm relatively new to bioinformatics and NGS and so far I've been looking at 2 programs to achieve this: Artificial fastq generator and ART.
My artfastgen input looks like this:
java -jar ArtificialFastqGenerator.jar -O output.fastq -R input.fasta -S ">1 dna:chromosome" -RL 12 -TLM 12 -GCC False
The (partial) output is as below:
@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
CCTTAGCAGGTT
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CTTAGCAGGTTC
This is very close to what I actually need. The only problem is that each read shifts across one base at a time, whereas I need it to shift 12 (or to the start of the next UMI in the input FASTA).
My ART input is this:
art_illumina -ss HS25 -i input.fasta -l 12 -f 1 -o output_dat
Example output varies with each run and is like this:
@1-ACC3
TAGCAGGTTCGA
+
CCCCCGGGGGGG
@1-ACC2
CAGGTTCGAACC
+
CCCBCGGGEGGG
@1-ACC1
CGTCACAGGTTC
+
BBBCCGBGGGGG
The issue here is that, unlike artfastqgen which allows you to specify the start of the read, ART chooses a random point in the FASTA as the starting point. Like artfastqgen, there is unwanted overlap in the read sequences.
Can anyone hear offer any suggestions on how to achieve this? Ideally using artfastqgen, but any other Linux-based program would also be fine. Thank you.
I don't believe that there are any fastq file simulation programs available that will generate UMI's along with the reads at this time.
Though I generally second genomax opinion, I would like to add a very comprehensive Nature Genetics Review on the simulation topic:
Escalona et al, 2018: A comparison of tools for the simulation of genomic next-generation sequencing data
Since posting I've found a way to achieve this using ART, but this looks like a very relevant and helpful review. Thanks for the input.