Question

Matching R1 and R2 in paired end reads from Illumina

0

Entering edit mode

2.5 years ago

Priyanka ▴ 10

I have a very basic curiosity.

How does tools match paired end reads?

Suppose I have Demultiplexed paired end reads from R2 taking it as single end read and then want to find the matching pairs for all of the read pairs in R1. So are the header information of pairs matching somehow or do the tools that take in paired end reads take reads one after other in the order they appear?

And if I want to manually find read pair of specific R2 read in R1, is there any way to do so?

illumina reads paired-end • 1.4k views

ADD COMMENT • link 2.5 years ago by Priyanka ▴ 10

score 1 · Answer 1 · 2022-07-14

I assume you are asking about Illumina data since that is the tech producing paired-end reads now (other technologies have done so but I am not sure how common they are, BGI may be one). You have likely seen the spec for fastq format spec. Read numbers are found in fastq header.

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

*1*:Y:18:ATCACG - 1 represents read 1.

Most tools may not check for 1:1 concordance of reads in R1/R2 files, most aligners don't. So don't assume tools are paying attention to headers. Tools will generally expect reads to be in sync in R1/R2 files.

Note: If you suspect that your R1/R2 files are not in sync then a tool like repair.sh from BBMap suite can be used to remove singleton reads to bring them back in sync.

You can find specific reads searching your files with zgrep (gzipped)/grep(plain fastq) -A 3 header_ID yourfile_R1/R2.