Hello,
I'm looking through some FastQC reports on public data I've downloaded. The reports are mediocre quality, and the trimmers aren't making a difference.
I've discussed this with a colleague, who says that my problem is that many reads are showing overrepresentation in the middle of the sequence. Because the trimmers look for trimming at the beginning and end of sequences, the trimmers or QC tools won't be able to make a difference. My colleague used the phrase "micro-satellites" which will be seen around nucleotides 105-109 with a sharp drop and then rise in %A. I think this will have a very negative effect on the alignment process.
Are there any tools to correct such micro-satellites here? Is this the correct phrase to describe this error? Should I even worry about this?
How did you download this data? Did you split read 1 and read 2? Are you sure these should be 200bp reads?
fastq-dump
has an option to split R1+R2, in case this is paired end data.Agree with you, seems the data is a combination of R1+R2. The abnormal cycles in the middle are actually the beginning cycles of R2.
@chen and @h.mon I have written a program, because the NCBI page https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2532810 doesn't suggest anything about paired-end reads. Also, if I remember how paired-end reads work, there should be some indication of this in the SEQID lines of the file should indicate this. I don't see it here when I show the top label lines:
This is a paired-end set of data. If you look at SRA you will see that. I suggest that you avoid GEO/SRA alltogether and download the fastq files from ENA.
hi h.mon,
did you mean
--split_spot
or--split-3
?--split-files
should be the option. While you are at it use-F
to recover original Illumina style fastq read headers.That said, see my comment above about getting the fastq files directly from ENA.