Illlumina Paired End Reads File Format
2
2
Entering edit mode
13.5 years ago
Travis ★ 2.8k

Hi all,

I should be receiving several million PE reads from multiple samples/lanes soon and I am wondering what format the files take.

I know they will be FASTQ but I am wondering do they generally come as one sample per file, one lane per file or something else? Also do the paired ends come in the same or different files?

I plan to align with BWA and it looks like it expects separate files for the paired ends. Is this correct? If samples/lanes/ends need to be separated into individual files, is there a standard way of doing this?

Thanks in advance.

next-gen sequencing paired • 9.3k views
ADD COMMENT
4
Entering edit mode
13.5 years ago

For illumina, you should receive two fastq files (_1.fastq and _2.fastq) having the same number of reads in each file. The elements of each pair are have the same index.

[?]

ADD COMMENT
2
Entering edit mode

Travis: yes, if you do paired-end sequencing, you get two files. The naming depends on technology or company, e.g. we get file names called 123456_s_N_[12]_lib.txt, where the first number is a serial number, N is lane (I believe), [12] is 1 or 2, i.e which end, and lib is a designation for the library used. But the contents is like Pierre says.

ADD REPLY
0
Entering edit mode

Thanks! Are there two files per sample?

ADD REPLY
3
Entering edit mode
13.5 years ago

Hi, Travis.

You'll want to be in touch with the sequencing center providing the sequencing service. They will not likely combine lanes of data into samples if samples are run in multiple lanes; if they do, you should ask them to split them up again or do so yourself. If there are multiple samples per lane (multiplexed), you or they will need to split based on the index barcode. In general, you will want to learn about the SAM/BAM format and Read Groups so that you can keep track of various units of data such as library, sample, and lane as you move through downstream analyses. The importance of doing so will depend on the scientific application (pretty important for variant calling but perhaps not so much so for gene expression)....

Sean

ADD COMMENT
0
Entering edit mode

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/

ADD REPLY
0
Entering edit mode

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/lane/etc

ADD REPLY
0
Entering edit mode

The samtools site (http://samtools.sourceforge.net) is a good place to look for sam-specific information. In particular, the sam format is described here: http://samtools.sourceforge.net/SAM1.pdf. The GATK website is a good place to learn about DNA sequence data analysis, though those tools might not always be the best ones for the job.

ADD REPLY
0
Entering edit mode

I have already downloaded samtools and GATK and done some background reading but I keep getting myself hung up on small details :) I guess I really just need to generate some dummy reads and run through a couple of workflows with BWA/Samtools and GATK.

ADD REPLY
0
Entering edit mode

Sean why do you need to keep the data from seperate lanes in seperate files? Isn't it OK to combine data from seperate lane basd file into file.1.fastq and file.2.fastq (forward and reverse) and perform alignment on those?

ADD REPLY

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6