Question

Coping with duplicated sequences in shotgun analysis

0

Entering edit mode

3.2 years ago

D ▴ 10

Hello,

I'm performing the first bioinformatic analysis from a shotgun sequencing.

I have 150 samples analysed by paired-end sequencing. Therefore, having 300 fastq files (forward + reverse).

I tried to remove the duplicates with the clumpify software, from Joint Genome Institute.

From the 300 fastq files, 7 samples were flagged as warning in FASTQC (MultiQC output), which means that non-unique sequences make more than 20% of the total library, according to FASTQC instructions.

Interestingly, 3 samples of 7 were the longest fastq files in size that I could hypothesize that hat the sequencing output is very deep and a sequence duplication is probable. And the rest, are the opposite (having lower fastq size) that then could be a low rich libraries and a duplication of sequences could be more plausible due to a less finite number of sequence existence in the library.

In detail, I run clumpify several times modifying the parameters of subs (with subs=1 and subs =2) and also performing 2 passes and 6 passes and with this changes the 7 samples warning was reduced to 5.

But, due to its my first time in shotgun data analysis, I would like to know your opinion about to use this 5-7 samples with duplicate sequences warning flag to downstream analysis or better not?

I attach an image of MultiQC output.

enter image description here

Thanks on advance for your comments,

Magí.

Clumpify • 1.1k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 3.2 years ago by D ▴ 10

0

Entering edit mode

There are certainly regions of a bacterial genome that are likely to be identical within and between taxonomic groups (e.g., 16S gene). If you sequenced a host-associated sample (e.g., bovine or human), it could be that you are picking up some repetitive host DNA. I think the next step would be to figure out which sequences are duplicated, BLAST them to see what they are, and then making a decision about whether to keep/exclude them.

ADD REPLY • link 3.2 years ago by Chris Dean ▴ 420

score 0 · Answer 1 · 2022-02-18

0

Entering edit mode

3.2 years ago

D ▴ 10

Hi Chris Dean,

Thanks for your comments.

Yes it was a human metagenomics study. Nonetheless, previously to clumpify I removed human sequences with Bowtie2 towards GRCh38 human assembly with option very-sensitive-local.

Thanks another time,

Magí

ADD COMMENT • link 3.2 years ago by D ▴ 10