Hello,
I'm performing the first bioinformatic analysis from a shotgun sequencing.
I have 150 samples analysed by paired-end sequencing. Therefore, having 300 fastq files (forward + reverse).
I tried to remove the duplicates with the clumpify software, from Joint Genome Institute.
From the 300 fastq files, 7 samples were flagged as warning in FASTQC (MultiQC output), which means that non-unique sequences make more than 20% of the total library, according to FASTQC instructions.
Interestingly, 3 samples of 7 were the longest fastq files in size that I could hypothesize that hat the sequencing output is very deep and a sequence duplication is probable. And the rest, are the opposite (having lower fastq size) that then could be a low rich libraries and a duplication of sequences could be more plausible due to a less finite number of sequence existence in the library.
In detail, I run clumpify several times modifying the parameters of subs (with subs=1 and subs =2) and also performing 2 passes and 6 passes and with this changes the 7 samples warning was reduced to 5.
But, due to its my first time in shotgun data analysis, I would like to know your opinion about to use this 5-7 samples with duplicate sequences warning flag to downstream analysis or better not?
I attach an image of MultiQC output.
Thanks on advance for your comments,
Magí.
There are certainly regions of a bacterial genome that are likely to be identical within and between taxonomic groups (e.g., 16S gene). If you sequenced a host-associated sample (e.g., bovine or human), it could be that you are picking up some repetitive host DNA. I think the next step would be to figure out which sequences are duplicated, BLAST them to see what they are, and then making a decision about whether to keep/exclude them.