Question

Why There Are 3 Fastq File In This Pair-End Data?

3

Entering edit mode

12.8 years ago

Hanfei Sun ▴ 60

Raw data: http://www.ebi.ac.uk/ena/data/view/SRR346373&display=html

Also on NCBI: http://www.ncbi.nlm.nih.gov/sra?term=%09SRR346373

I downloaded them and the first 4 lines looks like the following:

SRR346373$ head -4 S*fastq
==> SRR346373_1.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/1
T23133223302220122222232212322320332
+
!%(#$%#$%%####*%#%##&#$##$##&#&#$$,+

==> SRR346373_2.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/2
G0012130112
+
!*)&#$&'###

==> SRR346373.fastq <==
@SRR346373.1 0176_20090623_2_H3K4me3_3_25_119/1
T30200011130100000000000000000000000
+
!%/%%5)&4(%#(7&?2&'6&.,684;.6>',7A?1

It seems obvious that 2 and 1 fastq are within a pair-end data. But what does SRR346373.fastq stands for? It is much smaller than the other two fastq file(1/20 lines of them). Anyone knows what does it means?

paired-end solid barcode • 9.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 12.8 years ago by Hanfei Sun ▴ 60

1

Entering edit mode

It looks like SRR346373 is the first read, SRR346373_1 is the second read and SRR346373_2 is the barcode. The NCBI page you link to has details associating each barcode sequence with the sample and replicate.

ADD REPLY • link 12.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

I don't think so, because SRR346373_1.fastq and SRR346373_2.fastq both have 87354416 lines and SRR346373.fastq has 4213292 lines, it's possible that SRR346373_1.fastq is paired with SRR346373_2.fastq, but if SRR346373.fastq is the Barcode file, how could it has so few lines..

ADD REPLY • link 12.8 years ago by Hanfei Sun ▴ 60

0

Entering edit mode

I read the NCBI page about barcode and try to split the barcode file, but if the barcode file can't map to the pair-end files "Line-by-line", I don't think it make sense.

ADD REPLY • link 12.8 years ago by Hanfei Sun ▴ 60

0

Entering edit mode

Hi all,

Sorry to bring you back to this old thread as I noticed something new in relevance to this thread. In the past, when I used wget and local fastqdump, I usually only get the _1.fastq.gz and _2.fastq.gz. But sometimes also the 3rd file for the single reads. However, in my recent direct use of fastqdump (v2.6.3) from the NCBI server with /fastq-dump with --split-files --gzip sraID (no choice as the ftp url is no long available), I got _1.fastq.gz and _3.fastq.gz (instead of _2), which seem to represent the pair-end sequences. In agreement with this, on the sra record, it indicates the barcode is between the two reads. So I guess in this case, the _1 and _3 are for pair-end sequences if --split-files is used, and I haven't tried to use --split-3, perhaps it will produce _1 and _2 and the 3rd . Below are the output of the first read from both _1 and _3.

$ zcat SRR395614_1.fastq.gz |head -n 4
@SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
AAAGAATGGAATCATCAAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNNNNNNNCNTNGNNNNNNNTCCNNNNNAATNATNGNATAAAATCGAA
+SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
<<<???@???@@?@?@@@??#################################################################################
$ zcat SRR395614_3.fastq.gz |head -n 4
@SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
TCGAGTCAATTCGACGATTCTATTCCATTCCCTTCGATGATGATTCCATTTCACTCCATTAGATGATTCCATTCGACTCAATTTGGTGATGATTCAATTCG
+SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
@@@FFBDEHHHHHGBGIJJGGGHGIGIIHJJJJJJJJGGCGIEHI@FIHIFHEGGDHBFICGEHIJJJEHGHIEHHIJHHCEHHFEBDFEEEFEECEEECD

I also noticed the much slower speed compared to wget, and will try to the option of converting fastq to fastq.gz locally. Any comments/corrections are appreciated.

Thanks a lot.

Ping

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 8.2 years ago by liangp64 • 0

0

Entering edit mode

If in doubt grab the fastq files from ENA directly.

ADD REPLY • link 8.2 years ago by GenoMax 147k

score 4 · Answer 1 · 2012-02-02

4

Entering edit mode

12.8 years ago

Jonathan Manning ▴ 630

I'd guess it is a file of the remaining unpaired reads.

The _1 and _2 files should have the same sequence IDs in the same order. The third file contains reads for which paired sequence was not generated and may contain reads labeled either /1 or /2.

Structuring the data this way saves having to do the uneven traversal of the two files, you can always assume that the 200th read in the _1 file corresponds to the 200th read in the _2 file.

Being AB_SOLiD data, the _1 file is the Forward [F3] read (T prefix), the _2 file is the Reverse [R3] read (G prefix).

ADD COMMENT • link 12.8 years ago by Jonathan Manning ▴ 630

0

Entering edit mode

I think that makes sense, thanks!

ADD REPLY • link 12.8 years ago by Hanfei Sun ▴ 60

score 0 · Answer 2 · 2012-02-08

0

Entering edit mode

12.8 years ago

Ahdf-Lell-Kocks ★ 1.6k

The third file is the barcode, the other two are the paired end reads.

ADD COMMENT • link 12.8 years ago by Ahdf-Lell-Kocks ★ 1.6k