Question

Sudden quality drop in the middle of HiSeq R1 reads but not in R2?

2

Entering edit mode

9.7 years ago

sentausa ▴ 650

Dear all,

I'm analyzing eukaryotic metatranscriptomics HiSeq (2x100 bp) data from soil (the cDNAs are from eukaryotic poly-A mRNAs). The FastQC report of the R1 reads shows that there is a sudden sequence quality drop around the 50th nucleotide, even some unknown bases (Ns), and high Kmer content (see here). However, this drop in quality and unknown bases are absent in the R2 reads (there).

I wonder if it is normal to have something like this, and if it is not, what should I do to improve the read quality?

metatranscriptomics RNA-Seq Illumina • 6.5k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by sentausa ▴ 650

4

Entering edit mode

9.3 years ago

chen ★ 2.5k

A tool in github can remove these bubble effects

https://github.com/OpenGene/AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for Fastq Data
Currently it supports Illumina 1.8 or newer format

AfterQC does following tasks automatically:

    Filters reads with too low quality, too short length or too many N
    Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences
    Does per-base quality control and plots the figures
    Trims reads at front and tail, according to QC results
    For pair-end sequencing data, AfterQC automatically corrects low quality wrong bases in overlapped area of read1/read2
    Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues
    Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing), AfterQC shifts the barcodes from the reads to the fastq query names
    Support both single-end sequencing and pair-end sequencing data

ADD COMMENT • link 8.3 years ago by chen ★ 2.5k

1

Entering edit mode

Looks interesting, is there any documentation on how it works (just to save me going through the code)? In particular, information on how it does bubble detection would be nice.

ADD REPLY • link 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Emm.. no any document right now.

Just started writing a small paper, will share it to u when it is done :)

ADD REPLY • link 9.3 years ago by chen ★ 2.5k

0

Entering edit mode

Thanks for sharing. I'm analyzing public metagenomics data and encountered sudden drop in quality in 50th and 76th bases in both read1 and read2. After reading this thread I thought it is a bubble effect. So, I tried to apply AfterQC to fix that. It improved quality for sure, but the effect is still present, additionally, it seems that AfterQC doesn't trim nextera transposase sequence (should it?).

I've never encountered bubble before, so maybe it is not bubble at all. Could be great if you can advice smth.

I've analyzed raw fastq with fastqc and combined the results with multiqc, then applied afterqc, again analyzed with fastqc and combined with multiqc (to compare with previous). Check the screenshots.

multiqc_afterqc_adapters
multiqc_afterqc_quality
multiqc_raw_quality

multiqc_afterqc_adapters multiqc_afterqc_quality multiqc_raw_quality

ADD REPLY • link 7.6 years ago by aln ▴ 320

0

Entering edit mode

Hi @aln. I am seeing a similar pattern of sudden drops in sequence quality near the middle of my reads for my hiseq 4000 data. Have you had any luck yet with figuring this out? The only difference for my data is a single "blip" / decrease near the center of reads, rather than two as you have above.

Thanks.

ADD REPLY • link 7.5 years ago by davis.todt • 0

0

Entering edit mode

In my case it wasn't a bubble, but it could be in your case, so check with AfterQC. If you also have strange adapters content, which starts very early, please, read this issue - https://github.com/OpenGene/AfterQC/issues/16. It may have smth to do with this - https://www.ncbi.nlm.nih.gov/pubmed/24523726.

In the end I processed the data with AfterQC and additinally trimmed adapters with trim_galore (--nextera flag), while filtering for short read. I've got affordable quality, though the gaps are still present, but with quality score > 20.

ADD REPLY • link 7.5 years ago by aln ▴ 320

Ram · Accepted Answer · 2015-03-25

4

Entering edit mode

9.7 years ago

Devon Ryan 104k

This is likely due to a bubble going through as the image was being acquired. This actually used to be fairly common. Whether this will be an issue or not will depend a bit on exactly what you want to do with the data.

ADD COMMENT • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for answering. Basically we'd like to know which genes are active and from which organism. So, I'm thinking to map the reads to some eukaryotic genomes that we have. Do you think it may cause a problem?

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

0

Entering edit mode

This should not be an issue for count based analysis.

ADD REPLY • link 9.7 years ago by GouthamAtla 12k

0

Entering edit mode

And what about paired-read merging and finally assembly? Do you think it'll be problematic to merge the paired-reads and later to do assembly?

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

0

Entering edit mode

Conceptually, if the overlap between two reads extends beyond the base where the quality is low, it would create a problem as the low quality base will be a mismatch. But you have to run it check how the results look.

ADD REPLY • link 9.7 years ago by GouthamAtla 12k

0

Entering edit mode

But now I wonder if the paired-read merging is really necessary. Any idea?

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

0

Entering edit mode

Further question: is the high Kmer content coming from the same technical problem? I wonder how the reads that are supposed to be coming from total mRNAs (which I supposed are quite different from one another) have similar sequences around their 50th nucleotide. Thanks again.

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

0

Entering edit mode

Quite possibly, since it occurs in the same spot. I expect that the 7mers are just what the base caller called due to the bubble (or whatever) being there. You might subset the file to look at only the tiles without the artifact and see if the jump in 7-mers in the middle of reads then goes away. Another possibility is that this is from a highly conserved region and there's biased fragmentation.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

1

Entering edit mode

And how to do "subset the file to look at only the tiles without the artifact"?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by sentausa ▴ 650

1

Entering edit mode

By tile name. This is one of those poorly documented things, unfortunately. Let's take a look at a typical read name:

@HWI-ST1140:171:C5TKCACXX:1:1101:1212:2202 1:N:0:CGATGT
CTCATTTTCCGTGATTTTCATTTTTCTCGCCATATTCCAGGTCCTTCAGT
+
=@@FFFFFGHHACFHEHHEEGGHHIGGI=FHEAADHGEIJGDDGGGAFG0

The HWI-ST1140 part is a machine ID, 171 is the run number, C5TKCACXX is a flow cell ID, 1 is the lane number and 1101 the tile on the lane. It's this tile number that we want to subset by. If you look at the fastQC output, you'll see that only tiles 2101 and up are affected. So something like

zgrep -E -A 3 ":1:1[0-9]{3}:" file.fastq.gz | gzip > file.subset.fastq.gz

should extract all reads not on those tiles. You can then run that through fastQC and see what you get. If some of the k-mers aberrations go away then you know the cause.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Actually, I guess this isn't so poorly documented.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Ah, OK. I'l try it. Never knew about this before. Thanks!

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

1

Entering edit mode

If you want even more obscure knowledge (I'd do well at bioinformatics pub quiz), the individual numbers in 1101 or 2308 also have meaning. The first digit (1 or 2) denotes the side of the flow cell (the camera just uses depth of field to distinguish between the sides). The second (1-3 I think) denotes the swath, and the last two the tile number within that swath. Currently there are 16 tiles per swath, though you'll see fewer (8 I think) in older datasets.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

2

Entering edit mode

I wonder how you acquired such knowledge....

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

0

Entering edit mode

I wonder that too sometimes!

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Seriously, how did you learn all of these? Did you follow a course somewhere? Did you get it from books? Journal articles?

ADD REPLY • link 9.7 years ago by sentausa ▴ 650

1

Entering edit mode

If you hang around forums like this one for a while you'll pick up a lot of arcane knowledge.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

This is interesting. How do we subset data that is not effected? Based on x,y coordinates of clusters on read name?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by GouthamAtla 12k

0

Entering edit mode

I would just use the tile number, since XY coordinates are within tiles (I think, don't quote me on that).

ADD REPLY • link 9.7 years ago by Devon Ryan 104k