Question

Mapping Large Fastq Files With Bwa

2

Entering edit mode

12.4 years ago

Vikas Bansal ★ 2.4k

I have fastq files for 10 samples. For each sample, I have 2 fastq files (paired end) and average size of compressed fastq file is 4gb and uncompressed is 16gb. It means, I have 20 uncompressed fastq files of size 320gb. Now I want to do mapping using BWA. I have 10 folders containing 2 files each.

I want to know if it is possible to input compressed fastq files in BWA?

What method would you use to map all these files? (fast and easy)

Should I just split each file and then map it?

I have seen some posts like this and tutorial, but did not find any efficient solution and I think there are lot of people here who do this often. I would really appreciate your help.

bwa mapping • 16k views

ADD COMMENT • link updated 12.4 years ago by pinkiii1984v ▴ 20 • written 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

What compute resources do you have available? A cluster or a single machine?

ADD REPLY • link 12.4 years ago by Sean Davis 27k

0

Entering edit mode

I have a single machine with 32GB RAM. I was thinking to do mapping using "screen" (10 screens) at same time for all samples. Or should I do it one by one?

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

1

Entering edit mode

You are probably better off running one-at-a-time and using multiple threads (approximately as many threads as you have cores), but you may need to experiment. The point, of course, is to have all the cores busy all the time.

ADD REPLY • link 12.4 years ago by Sean Davis 27k

0

Entering edit mode

Thanks for your reply. Could you please give me some reason that why running one by one is better? I thought may be if I will run 10 screens, then I could do it for all samples at same time?

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

2

Entering edit mode

You could run 10 samples at once, each using 1 core, or run the samples one-at-a-time using 10 threads (or more) for each sample. The advantage of the second over the first is that the memory usage will be about 1/10 of the use of the first. The time to complete all 10 samples should be similar.

ADD REPLY • link 12.4 years ago by Sean Davis 27k

0

Entering edit mode

Thanks a lot. I will try it.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

score 4 · Answer 1 · 2012-07-04

4

Entering edit mode

12.4 years ago

Leonor Palmeira 3.9k

[Edited]

Concerning handling compression in bwa, you should find your answer here : http://www.biostars.org/post/show/5474/bwa-index-on-all-human-grch37-sequences

Apart from that, 2Gb files is not that big, so you could process them separately (i.e. parallelization by data) which shouldn't take too long on a multi-thread machine.

ADD COMMENT • link 12.4 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Thanks. For compressed fastq files, its clear now. Now I am looking for efficient technique as mentioned in my original post.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Thanks a lot Sean and Leonor.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

score 0 · Answer 2 · 2012-07-06

0

Entering edit mode

12.4 years ago

pinkiii1984v ▴ 20

I too work with compressed files and it is possible to use them with BWA.

ADD COMMENT • link 12.4 years ago by pinkiii1984v ▴ 20