Question

"Valid" names for paired-end reads

1

Entering edit mode

10.6 years ago

Rob 7.1k

So, I realize the answer to this question might be "there is no standard, anything can happen." However, I'm curious what are the valid ways in which paired-end reads can be named. For example, I know it's possible for both mates of a paired-end read to have exactly the same name. Also, sometimes they are named as X\1 and X\2 where X is an identical prefix shared by both reads. Also, sometimes we get the lovely X_1 and X_2. What other variants are possible? Are there any restrictions on exactly what prefix must be shared and how the two reads in a pair must be named?

next-gen-sequencing RNA-Seq • 4.1k views

ADD COMMENT • link updated 18 months ago by xiaoleiusc ▴ 140 • written 10.6 years ago by Rob 7.1k

Ram · Answer 1 · 2014-11-14

1

Entering edit mode

10.6 years ago

SES 8.6k

The most common variants you will see come from the lllumina identifiers, which is explained on the FASTQ format wikipedia page. I have also seen people use "a" and "b" to denote forward and reverse, or simply leave off the identifier. Things get complicated when people start using their own identifiers because then standard tools aren't guaranteed to work properly.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by SES 8.6k

1

Entering edit mode

So, as a follow up. Is it true that, when writing out the SAM/BAM files, read mappers uniformly remove these extra identifiers? Specifically, is it valid to assume that in a BAM file, read1 and read2 will always have exactly the same QNAME?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Rob 7.1k

0

Entering edit mode

If I understand correctly, that is a specification of the SAM format, which would explain why alignment programs output this format. Hopefully, someone that knows more than I do will comment to clarify this.

ADD REPLY • link 10.6 years ago by SES 8.6k

0

Entering edit mode

I encountered the same problem. After STAR alignment, the BAM files output the same name (identifier) for pair-end reads. The removal of extra identifiers to differentiate pair-end reads is perplexing. There is no purpose to remove the extra identifers.

ADD REPLY • link 18 months ago by xiaoleiusc ▴ 140