Coping with duplicated sequences in shotgun analysis
1
0
Entering edit mode
2.8 years ago
D ▴ 10

Hello,

I'm performing the first bioinformatic analysis from a shotgun sequencing.

I have 150 samples analysed by paired-end sequencing. Therefore, having 300 fastq files (forward + reverse).

I tried to remove the duplicates with the clumpify software, from Joint Genome Institute.

From the 300 fastq files, 7 samples were flagged as warning in FASTQC (MultiQC output), which means that non-unique sequences make more than 20% of the total library, according to FASTQC instructions.

Interestingly, 3 samples of 7 were the longest fastq files in size that I could hypothesize that hat the sequencing output is very deep and a sequence duplication is probable. And the rest, are the opposite (having lower fastq size) that then could be a low rich libraries and a duplication of sequences could be more plausible due to a less finite number of sequence existence in the library.

In detail, I run clumpify several times modifying the parameters of subs (with subs=1 and subs =2) and also performing 2 passes and 6 passes and with this changes the 7 samples warning was reduced to 5.

But, due to its my first time in shotgun data analysis, I would like to know your opinion about to use this 5-7 samples with duplicate sequences warning flag to downstream analysis or better not?

I attach an image of MultiQC output.

enter image description here

Thanks on advance for your comments,

Magí.

Clumpify • 997 views
ADD COMMENT
0
Entering edit mode

There are certainly regions of a bacterial genome that are likely to be identical within and between taxonomic groups (e.g., 16S gene). If you sequenced a host-associated sample (e.g., bovine or human), it could be that you are picking up some repetitive host DNA. I think the next step would be to figure out which sequences are duplicated, BLAST them to see what they are, and then making a decision about whether to keep/exclude them.

ADD REPLY
0
Entering edit mode
2.8 years ago
D ▴ 10

Hi Chris Dean,

Thanks for your comments.

Yes it was a human metagenomics study. Nonetheless, previously to clumpify I removed human sequences with Bowtie2 towards GRCh38 human assembly with option very-sensitive-local.

Thanks another time,

Magí

ADD COMMENT

Login before adding your answer.

Traffic: 1542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6