hello everyone I am running denovo rnaseq experiment at quality control step. I can’t interpret and trim two plot duplication level and GC content at output of fastqc . I have read some articles that is not recommended to remove duplicates for differential expression analysis. I don’t know that how can I trim the outputs Here is some details of the plots.
Plot of Sequence Duplication level . Percent of seqs remaining if deduplicated 48.8% Blue line show two tower : one between 9 to 50 of X axis with maximum of Y axis= 15% and second between 50 to 500 of X axis with maximum of y axis= 8% .
Plot of per Sequence GC content Red line of this plot have two peak at points : 1- X axis= 45 and Y axis= 500000 , 2: X axis= 72 and Y axis= 720000 blue line of this plot have one peak at point : X axis= 72 and Y axis= 720000
best regards
thanks for your answer. Can I ignore all output even if it shows two peaks in gc content? with regards
That is hard to judge without more information and the picture. I have just looked at some of our data, and most have a single bell-shaped distribution of GC with the mean very close to the GC of all exons in the organism. If you have two peaks, you could either have contamination from a different organism, or possibly some reads from this organism have very different GC, could be ribosomal RNA for example. Certainly, you need to understand what you are dealing with, for that you can make a plot of the distribution of the GC content for all genes, including ribosomal RNA and compare the distributions. In the end, however the question is if there is anything you can or need to do as a result from your findings. You should continue with your analysis and possibly check for contamination in addition, but that can only be done when taking all the data forward, by making either (pseudo-) alignments or assembly.