Question

Three fastq files in ENA for paired sequencing - PRJEB3381?

0

Entering edit mode

2.1 years ago

Dhana ▴ 110

Hi,

I am trying to do a benchmarking study using ENA datasets. I have downloaded the datasets from project with accession number PRJEB3381 belonging to CEPH pedigree 1463 using SRA toolkit (prefetch ERR194146 && fasterq-dump ERR194146). This is supposed to be paired end sequencing and I was expecting two files but there are three fastq files;

ERR194146_1.fastq

ERR194146_2.fastq

ERR194146.fastq

Someone had asked a similar question earlier but in my case NCBI website has not mentioned the file ERR194146.fastq as barcode (The library names are provided as - ERR194146_2, ERR194146, unspecified). In ENA, checking the sample accession (SAMEA1573614) it is defined as unspecified?

I checked the number of reads in ERR194146_1.fastq and ERR194146_2.fastq and they are the same so is it safe to ignore ERR194146.fastq and proceed with the other two?

Also, I checked the downloaded file via SRA vs direct wget download. The number of reads in both ERR194146_1.fastq and ERR194146_2.fastq is same but the number of reads in ERR194146.fastq is much less in SRA. What could be cause of it?

ENA WGS PRJEB3381 Illumina • 693 views

ADD COMMENT • link updated 2.1 years ago by GenoMax 147k • written 2.1 years ago by Dhana ▴ 110

score 2 · Accepted Answer · 2022-10-20

2

Entering edit mode

2.1 years ago

GenoMax 147k

ERR194146_1.fastq and ERR194146_2.fastq is same but the number of reads in ERR194146.fastq is much less in SRA. What could be cause of it?

It is possible that the third file is for singleton reads left over after trimming the PE reads. You could ignore those.

ADD COMMENT • link 2.1 years ago by GenoMax 147k