Hi,
I have a question involving assemblies and checking the quality of the contigs in those assemblies.
The data I have consists of Pair ended reads, I merged them using PEAR. The reads that were left unassembled by PEAR because they did not have enough overlap were fed to SOAP denovo for assembling. The two assemblies were merged. This gave me a large set of contigs and to check their quality I mapped my reads back to them. The data then looks something like this;
contig_name contig_coverage
1 14
2 9
3 8
4 15
5 14
6 13
7 3.000
The distribution looks something like this
On the left is the distribution with the contig that has 3000x coverage, and on the right I used boxplot to find outliers and removed them. It is pretty apparent that contig 7 is an outlier, but the way I removed it is not a proper way because the boxplot method is only useful for a normally distributed dataset and this looks more like log normally distributed.
I have tried using fitdistr()
to find the exact type of distribution but it does not produce a definitive answer.
Our statistician also mentioned using generalized linear models, then take the deviance and divide by degrees of freedom to get the distribution. and the closest to 1 is the distribution which I can use to determine the type of outlier detection I need... but I can't get this to work properly (see example down below)
glm1 <- glm(contig_coverage~1,data=data_set, family = gaussian(log))
glm2 <- glm(contig_coverage~1,data=data_set, family = poisson(log))
glm3 <- glm(contig_coverage~1,data=data_set, family = Gamma(log))
Can anybody help me with a rule/method/function with which I can find out which contigs have improbable coverage in R so that I can remove them from the final assembly?
Some doubts:
Thanks for reading my question :D to answer your points
How did you merge the reads? If I understood you correctly, you assembled using only the unmerged reads, is this correct? Is this genome or transcriptome? Why do you want to remove them?
Anyway, if your outliers are like contig 7, finding them should be straightforward.
Sorry I wasn't clear on that part. We merged reads using PEAR (Pair Ended reAd mergeR), that way we end up with lots of contigs. The read pairs that couldn't be merged by PEAR were fed to SOAP and assembled. Both assemblies combined results in final assembly.
Contig 7 is straightforward, but this is just and example, as you can see in the graph there isn't a definitive threshold where correct contigs end and improbable ones start so we want some function that can set that threshold for us, preferabbly one that takes into account the distribution so that it not just removes contigs based on a predetermined static value.
PEAR is not an assembler as in "genome assembler", I would rather call it "read merger".