Dear all,
I'm analyzing eukaryotic metatranscriptomics HiSeq (2x100 bp) data from soil (the cDNAs are from eukaryotic poly-A mRNAs). The FastQC report of the R1 reads shows that there is a sudden sequence quality drop around the 50th nucleotide, even some unknown bases (Ns), and high Kmer content (see here). However, this drop in quality and unknown bases are absent in the R2 reads (there).
I wonder if it is normal to have something like this, and if it is not, what should I do to improve the read quality?
This should not be an issue for count based analysis.
And what about paired-read merging and finally assembly? Do you think it'll be problematic to merge the paired-reads and later to do assembly?
Conceptually, if the overlap between two reads extends beyond the base where the quality is low, it would create a problem as the low quality base will be a mismatch. But you have to run it check how the results look.
But now I wonder if the paired-read merging is really necessary. Any idea?
Further question: is the high Kmer content coming from the same technical problem? I wonder how the reads that are supposed to be coming from total mRNAs (which I supposed are quite different from one another) have similar sequences around their 50th nucleotide. Thanks again.
Quite possibly, since it occurs in the same spot. I expect that the 7mers are just what the base caller called due to the bubble (or whatever) being there. You might subset the file to look at only the tiles without the artifact and see if the jump in 7-mers in the middle of reads then goes away. Another possibility is that this is from a highly conserved region and there's biased fragmentation.
And how to do "subset the file to look at only the tiles without the artifact"?
By tile name. This is one of those poorly documented things, unfortunately. Let's take a look at a typical read name:
The
HWI-ST1140
part is a machine ID,171
is the run number,C5TKCACXX
is a flow cell ID,1
is the lane number and1101
the tile on the lane. It's this tile number that we want to subset by. If you look at the fastQC output, you'll see that only tiles2101
and up are affected. So something likezgrep -E -A 3 ":1:1[0-9]{3}:" file.fastq.gz | gzip > file.subset.fastq.gz
should extract all reads not on those tiles. You can then run that through fastQC and see what you get. If some of the k-mers aberrations go away then you know the cause.
Actually, I guess this isn't so poorly documented.
Ah, OK. I'l try it. Never knew about this before. Thanks!
If you want even more obscure knowledge (I'd do well at bioinformatics pub quiz), the individual numbers in 1101 or 2308 also have meaning. The first digit (1 or 2) denotes the side of the flow cell (the camera just uses depth of field to distinguish between the sides). The second (1-3 I think) denotes the swath, and the last two the tile number within that swath. Currently there are 16 tiles per swath, though you'll see fewer (8 I think) in older datasets.
I wonder how you acquired such knowledge....
I wonder that too sometimes!
Seriously, how did you learn all of these? Did you follow a course somewhere? Did you get it from books? Journal articles?
If you hang around forums like this one for a while you'll pick up a lot of arcane knowledge.
This is interesting. How do we subset data that is not effected? Based on x,y coordinates of clusters on read name?
I would just use the tile number, since XY coordinates are within tiles (I think, don't quote me on that).