Entering edit mode
4.1 years ago
doinelpierrot
▴
50
Hello all,
I have multiple fastq files coming from different samples. Among them 2 show a significant diffrent GC content plot. I am wondering, is it possible from there to estimate the percentage of contaminated reads ?
Thanks
I don't think so. You may get a hint that there is contamination e.g. with rRNA or a different species or something like that but you can't determine % of contaminant reads that may be present unless you go looking for those contaminant reads.
What do you mean by "a significant GC content plot"?
a significant different GC plot, my bad !!
As genomax mentioned, no, you are not going to be able to determine this from GC content. If you have a large proportion of reads that don't map to the genome of your target organism, there are a few methods you could try.
I am doing de novo assembly. So far I am thinking of doing a pre-assembly with my samples with good gc content and then blast all my transcript to delete stranger transcripts. And then mapping all my reads to this transcriptom. And eventually do a final assembly with all mapping reads.
If you know/suspect that there is contamination, it may be best to address it up front before doing the assembly.
I have thougt about it but I can't blast 200 Gb of reads, I reduce considerably the data after assembly. Besides it seems to be a multi species contamination and I don't have the full genomes/transcriptomes of these associated species. So the other alternativethat was to identify the contaminants from a subset and then do a mapping on the full genome/transcriptom seem complicated.
Then you may want to treat your data as if it was a metagenomic dataset and use an assembler like metaSPAdes.