Question

Extract 1M reads from paired end fastqs

0

Entering edit mode

9.2 years ago

acorella ▴ 30

Hi,

I have paired end reads in 2 separate fastq files. I want to take a subset of these reads for a bowtie run to get insert size. I am familiar with how to break up an individual file into 1 million reads (i.e. here: https://www.biostars.org/p/66864/)

My Question: Do I need to ensure my reads are in the same order in each file before I do this? If so, how do I do this?

Thanks!

RNA-Seq • 4.0k views

ADD COMMENT • link 9.2 years ago by acorella ▴ 30

1

Entering edit mode

However, do I need to ensure my reads are in the same order in each file before I do this? If so, how do I do this?

If you have not done anything to the files (other than using a paired-end aware trimming program) then the reads should be in order in R1/R2 files.

The files can be repaired as follows, if you suspect that the pairing is broken. repair.sh is from BBMap suite.

repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

ADD REPLY • link 9.2 years ago by GenoMax 152k

0

Entering edit mode

Thank you! That was indeed the question I was trying to ask!

Is there a quick way you can tell if the pairing is broken?

ADD REPLY • link 9.2 years ago by acorella ▴ 30

0

Entering edit mode

reformat.sh from the same package has an option to to that:

reformat.sh in1=r1.fq in2=r2.fq vpair

That will just verify that the names indicate the reads are in the same order in each file. Incidentally, you can also randomly sample 1M pairs from them, like this:

reformat.sh in1=r1.fq in2=r2.fq out1=sampled1.fq out2=sampled2.fq samplereadstarget=1m

If your reads are overlapping, you can discover the insert size with BBMerge; if not, you'll need to use mapping.

ADD REPLY • link updated 9.2 years ago by GenoMax 152k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

duplicate of Selecting Random Pairs From Fastq? ?

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

And other similar threads (a subset)

How to randamly extract reads from a FASTQ file?
How To Randomly Select 20M Reads From A 200M Fastq File
Select sequences from fastq.gz file

ADD REPLY • link 9.2 years ago by GenoMax 152k

score 0 · Answer 1 · 2016-06-03

0

Entering edit mode

9.2 years ago

Biomonika (Noolean) 3.2k

seqtk sample with fixed seed should work for you. Take a look here:

Selecting Random Pairs From Fastq?

ADD COMMENT • link 9.2 years ago by Biomonika (Noolean) 3.2k