Is it possible to estimate the proportion of contamination from GC contents ?

0

Entering edit mode

4.1 years ago

doinelpierrot ▴ 50

Hello all,

I have multiple fastq files coming from different samples. Among them 2 show a significant diffrent GC content plot. I am wondering, is it possible from there to estimate the percentage of contaminated reads ?

Thanks

RNA-Seq • 847 views

ADD COMMENT • link 4.1 years ago by doinelpierrot ▴ 50

1

Entering edit mode

it possible from there to estimate the percentage of contaminated reads ?

I don't think so. You may get a hint that there is contamination e.g. with rRNA or a different species or something like that but you can't determine % of contaminant reads that may be present unless you go looking for those contaminant reads.

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

What do you mean by "a significant GC content plot"?

ADD REPLY • link 4.1 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

a significant different GC plot, my bad !!

ADD REPLY • link 4.1 years ago by doinelpierrot ▴ 50

0

Entering edit mode

As genomax mentioned, no, you are not going to be able to determine this from GC content. If you have a large proportion of reads that don't map to the genome of your target organism, there are a few methods you could try.

ADD REPLY • link 4.1 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

I am doing de novo assembly. So far I am thinking of doing a pre-assembly with my samples with good gc content and then blast all my transcript to delete stranger transcripts. And then mapping all my reads to this transcriptom. And eventually do a final assembly with all mapping reads.

ADD REPLY • link 4.1 years ago by doinelpierrot ▴ 50

0

Entering edit mode

If you know/suspect that there is contamination, it may be best to address it up front before doing the assembly.

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

I have thougt about it but I can't blast 200 Gb of reads, I reduce considerably the data after assembly. Besides it seems to be a multi species contamination and I don't have the full genomes/transcriptomes of these associated species. So the other alternativethat was to identify the contaminants from a subset and then do a mapping on the full genome/transcriptom seem complicated.

ADD REPLY • link 4.1 years ago by doinelpierrot ▴ 50

1

Entering edit mode

Then you may want to treat your data as if it was a metagenomic dataset and use an assembler like metaSPAdes.

ADD REPLY • link 4.1 years ago by GenoMax 147k

Login before adding your answer.