Question

Bisulfite Analysis With Illumina Fastq From Different Lanes

1

Entering edit mode

13.0 years ago

Tonig ▴ 440

Hi everybody,

I'm a newbie in NGS bisulfite Methylation analysis, so first, apologies iof the question is so simple but I couldn't find any topic related to: Anyway, we receive samples from BGI who work with Illumina, the problem working with Illumina as you may know, is the FASTQ paired-end files that we get are splitted in several lanes , I mean

NB_Lib1_L1_PE250_1.fq
NB_Lib1_L1_PE250_2.fq
NB_Lib1_L2_PE250_1.fq
NB_Lib1_L2_PE250_2.fq
NB_Lib1_L3_PE250_1.fq
NB_Lib1_L3_PE250_2.fq

Instead of NBLib1PE2501.fq NBLib1PE2502.fq

I followed the indications from other threads at Seqanswers and Biostar and I concatenated these 6 files in two ( 3 for the first paired-end and 3 for the second), the problem is that I got fastq files of 71 Gb, and I don't know if it is the proper way to do bisulfite analysis using BISMARK or Methylcoder ( basically, because the anlysis using methylcoder is not finished yet after three days running), as far I know, Illumina has a pipeline with Bismark for bisulfite analysis, but following this it is mandatory to use CASAVA and I don't want to. So, my questions are:

Do I have to work using this procedure, concatenating the files in spite of their weight, or is there any other procedure of working with this splitted Illumina fastq (i.e, analyse FASTQ from one lane, then another lane and so on, the problem is, how to concatenate the final results of this)

Thanks

illumina next-gen sequencing fastq • 3.7k views

ADD COMMENT • link updated 4.7 years ago by iraia.munoa ▴ 130 • written 13.0 years ago by Tonig ▴ 440

0

Entering edit mode

Hi to all, Maybe I am late to add a summary question for this problem. I have received my fastq files of the bisulfite methylation sequencing and I have 16 files (different lines) for each sample. So I was thinking in the better way to follow for the analysis 1) concatenate at the begining 2) or wait to the merge bam in samtools. I think the result will be the same but I am not sure. Can someone help me?

And a second question the command for the first option is as follows right? cat x.fastq x.fastq x.fastq x.fastq > total.fastq

Thanks in advance!

ADD REPLY • link 4.7 years ago by iraia.munoa ▴ 130

0

Entering edit mode

The answer from Sean Davis answers exactly this ;-) If these are lane (so sequencing) replicates, then just merge them.

ADD REPLY • link 4.7 years ago by ATpoint 86k

score 4 · Answer 1 · 2012-01-04

4

Entering edit mode

13.0 years ago

Sean Davis 27k

The fastq files from different lanes can be mapped independently. The output files, hopefully in SAM/BAM format, can then be easily merged using samtools.

ADD COMMENT • link 13.0 years ago by Sean Davis 27k

1

Entering edit mode

I should add that, since reads are mapped independently, you should get the same result when combining the FASTQ files or the resulting BAM files.

ADD REPLY • link 13.0 years ago by Sean Davis 27k

score 2 · Answer 2 · 2012-01-04

Three days isn't necessarily surprising, given the size of your files and depending on how many CPU cores you're throwing at the problem.

If you have a cluster with many CPUs available, my advice would be to map the fastqs independently, then combine the results. (If you don't have this access, and are doing this on your desktop, you might just be better off waiting for it to finish at this point)

Combining could be done directly after mapping by using "samtools merge" on your bams. If you prefer to wait until later to merge, there's no reason why a simple little script couldn't be used to combine the methylation reports. Those should simply be a list of genomic positions along with counts of methylated Cs, unmethylated Cs, and a ratio between the two. A few lines of perl/python/whatever language should be able to merge those for you.