dear all,
i am trying to generate simulated fastq files from a fasta reference using ART. Following the manual, I entered the following:
art_illumina -ss HS25 -i ./input.fa -p -l 50 -f 20 -m 200 -s 10 -o ./output
In this case, the simulated instruments is Illumina HiSeq2500, pair mates created, length 50 pb with mean of 200 (not sure what the difference is here) and a coverage of 20. I then checked the quality of the output with FastQC and I get reads of 50 bp in length but the quality is all skewed at the maximum of 38 quality: I therefore provided the values for maximum and minimum quality score:
art_illumina -ss HS25 -i ./input -p -l 36 -f 30 -m 50 -s 10 -qU 30 -qL 25 -o ./output
but in this case the quality score was not simply skewed: rather it was uniform with a single value of 30:
How can I obtain something more like the following plot? Thank you.
No it is not. You can run sequencing lengths as long or as short as you want. Maximum length for HiSeq 2500 rapid run can be 2 x 250 bp. In order to get specific enough mapping you probably don't want to go much below 36 bp (for a human sized genome).
OK, I took the lower end of the scale. But how would I set a good range of quality score? and what is the relation between -l and -m? Tx
You may want to look at the in-line help/manual for the specific options of ART.
Realistically if your libraries are good then you are rarely going to see Q scores below 30 across the board. So things between 30-40 would be fine. If you are artificially trying to achieve a different range then you can choose those numbers.
I am trying to aritficially create some libraries that look like THIS. I therefore provided, based on the readme file included with ART, the options
-qL --minQ the minimum base quality score
and-qU --maxQ the maxiumum base quality score
but the values were not randomly sampled between these boundaries but fixed at 30.I think I got the difference between -l and -m: the former is the length of the read in the fastq file, the latter the length of the fragment of DNA/RNA that is being sequenced, therefore m needs to be longer than l.
For better focus, I removed the part fo the post dealing with the reading lenght