Entering edit mode
5.1 years ago
marongiu.luigi
▴
730
hello
I have created a pair of fastq files with ART. I created a pair of fastq files for each human chromosome and then concatenated them. I have inserted some mutations in the chromosomes and generated two files for each chromosome in order to simulate two alleles:
cat ch_01.allA.fa ch_01.allB.fa > ch_01.fa
art_illumina -1 -p -f 100 -l 140 -m 300 -s 10 -i ch_01.fa -o ch-01_
gzip ch-01_
[repeat for all chromosomes]
zcat ch-01_1.fq.gz ... > file_1.fq.gz
I used FastQValidator to check the consistency of the files but I get:
$ fastQValidator --file file_1.fq.gz
ERROR on Line 329301201: Repeated Sequence Identifier: 1-164667300/2 at Lines 1 and 329301201
ERROR on Line 329301205: Repeated Sequence Identifier: 1-164667298/2 at Lines 5 and 329301205
ERROR on Line 329301209: Repeated Sequence Identifier: 1-164667296/2 at Lines 9 and 329301209
ERROR on Line 329301213: Repeated Sequence Identifier: 1-164667294/2 at Lines 13 and 329301213
ERROR on Line 329301217: Repeated Sequence Identifier: 1-164667292/2 at Lines 17 and 329301217
...
this also for file_2.fq.gz.
What would be the cause? Can I fix these files?
Perhpas this is some internal limitation of ART. I see that you have simulated 300 million read? You could try
mutate.sh
from BBMap suite ifART
has a limitation as an alternative.Why 300 M reads? I set for:
which version of ART are you using?
you are right, there are 666 136 502 reads
Why not generate the reads for the entire genome at one time. Since you generated them piecemeal the read header seems to have been duplicated.