Question

Fastq error compiled with ART

0

Entering edit mode

5.1 years ago

marongiu.luigi ▴ 730

hello

I have created a pair of fastq files with ART. I created a pair of fastq files for each human chromosome and then concatenated them. I have inserted some mutations in the chromosomes and generated two files for each chromosome in order to simulate two alleles:

cat ch_01.allA.fa ch_01.allB.fa > ch_01.fa
art_illumina -1 -p -f 100 -l 140 -m 300 -s 10  -i ch_01.fa -o ch-01_
gzip ch-01_
[repeat for all chromosomes]
zcat ch-01_1.fq.gz ... > file_1.fq.gz

I used FastQValidator to check the consistency of the files but I get:

$ fastQValidator --file file_1.fq.gz
ERROR on Line 329301201: Repeated Sequence Identifier: 1-164667300/2 at Lines 1 and 329301201
ERROR on Line 329301205: Repeated Sequence Identifier: 1-164667298/2 at Lines 5 and 329301205
ERROR on Line 329301209: Repeated Sequence Identifier: 1-164667296/2 at Lines 9 and 329301209
ERROR on Line 329301213: Repeated Sequence Identifier: 1-164667294/2 at Lines 13 and 329301213
ERROR on Line 329301217: Repeated Sequence Identifier: 1-164667292/2 at Lines 17 and 329301217
...

this also for file_2.fq.gz.

What would be the cause? Can I fix these files?

fastq fastqvalidator art • 1.1k views

ADD COMMENT • link 5.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Perhpas this is some internal limitation of ART. I see that you have simulated 300 million read? You could try mutate.sh from BBMap suite if ART has a limitation as an alternative.

ADD REPLY • link 5.1 years ago by GenoMax 148k

0

Entering edit mode

Why 300 M reads? I set for:

-f read coverage = 100
-l length of reads = 140
-m mean size of DNA/RNA fragments for paired-end simulations = 300
-s standard deviation of the fragment length = 10

ADD REPLY • link 5.0 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

which version of ART are you using?

ADD REPLY • link 5.1 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

you are right, there are 666 136 502 reads

ADD REPLY • link 5.0 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

I created a pair of fastq files for each human chromosome and then concatenated them.

Why not generate the reads for the entire genome at one time. Since you generated them piecemeal the read header seems to have been duplicated.

ADD REPLY • link 5.0 years ago by GenoMax 148k