The following fastqc report is common to most replicates of a mRNA-seq experiment:
there's bias in the beginning of the reads in the 'Per base sequence content' and there's 'Kmer Content' bias, however there's no error in 'Adapter content' module.
a) Does it mean that although reads are not contaminated with most known 'adapters' (like trueseq2 or nextera) they could be contaminated with other less common adapters? Note: I'm not sure which adapters were used in library preparation
a-1) Should I make a file with all types of adapters and use that file to remove from reads, or in case there's no adapter contamination this might bring problems?
b) 'Sequence duplication levels'
module also shows a warning, and we can see some duplicates 10-50 duplicated reads. However, if choose to remove duplicates, I will loose ~45% of the library. Should I remove duplicates or is this duplication level normal for highly expressed genes? Note: Total number of reads is ~ 15 million.
Regarding Post #1, there are some sentences i do not understand:
-> "The question then arises as to whether this bias has any implications for downstream analyses. There are a couple of potential concerns: 1-It’s possible that there is increased mis-priming as part of the bias – introducing an increased number of mis-called bases at the start of the sequence."
-> "The bias at the start of the sequences appears to be the result of biased selection of fragments from the library, so high levels of predicted SNPs are not an issue. "
-> "People often suggest fixing this issue by 5′ trimming of the reads to remove the biased portion – this however is not a fix. Since the biased composition is created by the selection of sequencing fragments and not by base call errors the only effect of trimming would be to change from having a library which starts over biased positions, to having a library which starts slightly downstream of biased positions." In this last sentence I understand that this won't solve the problem of having some overepresented fragments (fragments to which primers bind more) over others, but doesn't it solve the alignment problem? I mean...although reads are smaller after trimming, without biased portion they should align better, or not?
In practice having that bias at the beginning of reads is shown to not cause any problems with alignment of data. You can verify this yourself with your own data. Since the bias will equally affect all samples that should not cause any batch effect when you do the analysis.
If you feel comfortable losing 15 bp of good data at the beginning of the read then you are welcome to chop those off. Remember that smaller reads could mean less precise mapping (so your alignment results may actually suffer). This will depend on the length of the read left after you scan/trim for adapter and additionally remove the 15 bp at front.