Entering edit mode
8.5 years ago
acorella
▴
30
Hi,
I have paired end reads in 2 separate fastq files. I want to take a subset of these reads for a bowtie run to get insert size. I am familiar with how to break up an individual file into 1 million reads (i.e. here: https://www.biostars.org/p/66864/)
My Question: Do I need to ensure my reads are in the same order in each file before I do this? If so, how do I do this?
Thanks!
If you have not done anything to the files (other than using a paired-end aware trimming program) then the reads should be in order in R1/R2 files.
The files can be repaired as follows, if you suspect that the pairing is broken.
repair.sh
is from BBMap suite.Thank you! That was indeed the question I was trying to ask!
Is there a quick way you can tell if the pairing is broken?
reformat.sh
from the same package has an option to to that:That will just verify that the names indicate the reads are in the same order in each file. Incidentally, you can also randomly sample 1M pairs from them, like this:
If your reads are overlapping, you can discover the insert size with BBMerge; if not, you'll need to use mapping.
duplicate of Selecting Random Pairs From Fastq? ?
And other similar threads (a subset)
How to randamly extract reads from a FASTQ file?
How To Randomly Select 20M Reads From A 200M Fastq File
Select sequences from fastq.gz file