Question

RRBS and Ilumina read size

0

Entering edit mode

9.6 years ago

pwh • 0

I'm new at RRBS and have just received some 100bp data sequenced on Illumina HiSeq.

The problem I have is this: in RRBS the DNA is digested to varying lengths, many below 100 bases. So how can 100bp reads be produced by a HiSeq? If, say, a 40bp fragment is sequenced on HiSeq to 100bp, what is actually sequenced from base 41 -> 100?

RRBS • 3.4k views

ADD COMMENT • link updated 9.6 years ago by Shicheng Guo ★ 9.6k • written 9.6 years ago by pwh • 0

0

Entering edit mode

Was the adapter removed from the reads?

ADD REPLY • link 9.6 years ago by Zaag ▴ 870

0

Entering edit mode

Adapter and quality trim with "Trim Galore!" and this will be taken care of.

ADD REPLY • link 9.6 years ago by Devon Ryan 105k

score 2 · Answer 1 · 2016-04-19

2

Entering edit mode

9.6 years ago

Shicheng Guo ★ 9.6k

There will be two adaptor (P5 and P7) ligated on the two side of your fragrament (40bp). In addition, there also will be index nearby the adaptor, sometime there will be also with UMI identifier. therefore, if the fragment is short then 100bp, sequencing will be still sequencing along the fragment that means P5 or P7 will be sequencing and then recorded them into the fastq file. The universal adapter is 58bp, the multiplex adapter is 63bp. if the adaptor length plus fragment length is still less than read length (such as 100bp), that means there will no more template, it will either be no signal or it will be repeat the last last base over and over again. but I think the sequencing machine might be remove such kind or repeat bases. therefore, the length of the reads in fastq will be different. Finally, It is easy to understand the length will be different if you used trim galore to remove the adaptor.

ADD COMMENT • link 9.6 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Thank you so much for that explanation Shicheng. It was very helpful.

I should have in my initial post that I have trimmed with Trim Galore (-rrbs parameter) and removed the adapters, poor quality bases, etc.

My mapping rate is poor though, at about 10-20% per sample. From the QC the reads are good quality, bisulfite conversion was almost complete (except for the methylated bases, presumably). When I run Bowtie2 with a local alignment, the mapping rate shoots up to about 90%, but with 80% of these being multiple alignments. Even with allowing for more mismatches, etc, the end-to-end mapping rate doesn't increase much, so it's not sequencing errors. So I'm unsure if there are still portions of the reads that have repeats, primers or some other non-template sequences.

Note that the trimming did change the length distribution of the reads. Before they were all 100bp, after a variety of sizes, as expected. However, the distribution still is heavily skewed toward the 90+ bp, though there were a few reads all the way down to 23bp. This size distribution seems wrong to me, indicating as well that the trimming didn't remove all the anomalous bases.

base content post trimming

Read length distribution

ADD REPLY • link 9.6 years ago by pwh • 0

0

Entering edit mode

What are you using for the alignment? Bismark or are you wrapper your own bowtie2 aligner?

ADD REPLY • link 9.6 years ago by Devon Ryan 105k

0

Entering edit mode

I converted the genome (C->T and G->A) using Bismark and was mapping to the converted genome using Bowtie2 (which Bismark uses itself) for testing purposes. I've done this with both single and paired-end reads, almost no difference in mapping rate.

My trimming and mapping strategies tell me the poor mapping rate is not due to a few sequencing errors (or the increased mismatch parameters would have caught it), sequence quality (which is good), primers (Trim Galore should have taken care of that) or other issues like incomplete bisulfite conversion. I'm at a loss.

The read length distribution worries me though. Shouldn't the distribution be skewed toward 40bp? Or at least far more shorter fragments? Perhaps there are still primers, fragments of primers or other multi-base contamination on the reads? I'm not sure of the size selection done by my service provider, but I've sent an email asking them.