Hello everyone,
As most of you know, when assembling polymorphic genomes, sometimes assemblers generate various haplotypes of short regions. This ends up overestimating the haploid genome size and the assembly contains many small contigs corresponding to haplotypes.
I have estimated the genome size of my organism using k-mer counting, and the estimated size was 5.0-6.0 Mb. Now I have two assemblies, one that is 7.9 Mb and one that is 5.6 Mb. Although no doubt the bigger assembly has some unique sequence to it, is there any way to test that it contains more redundancy than the smaller assembly?
I thought that maybe splitting assembly fasta files into small k-mers and calculating their abundance is a sound approach, so here is what I did with 2 different assemblies:
gdr18 <- read.table("gdr18_k75-85_NHC_kmer-k23.histo")
corn <- read.table("CornmanAssembly_kmer-k23.histo")
N <- length(gdr18[,1])
tablegdr18 <- numeric(N)
for(i in 1:N){
tablegdr18[i] <- eval(gdr18[i,1] * gdr18[i,2])
}
N <- length(corn[,1])
tablecorn <- numeric(N)
for(i in 1:N){
tablecorn[i] <- eval(corn[i,1] * corn[i,2])
}
head(tablegdr18)
[1] 5267262 244536 81888 30368 15220 11034
head(tablecorn)
[1] 5208491 1065576 650718 249380 142835 96012
sum(tablegdr18)
[1] 5678956
sum(tablecorn)
[1] 7739989
sum(tablegdr18[2:length(tablegdr18)])
[1] 411694
sum(tablecorn[2:length(tablecorn)])
[1] 2531498
So the Cornman assembly is 7.8 mb and my assembly is 5.6 mb. I was asked by a reviewer why my assembly is predicting a genome size that is so much smaller and I think it is because I sued a very high k-mer when assembling to discourage haplotype building. From the above data, it looks to me that a similar amount of kmers in both assemblies are in single copies, but the cornman assembly has about 2.5 mb of repetitive sequence. I just ran a quick script to multiply the 2 columns from the .histo file.
Any thoughts? Has this approach been used before?
Thank you,
Adrian
May be trying to find the known repeats in both the assemblies would help. Like running tools such as censor on both the assemblies, would give you in which assembly the there are more known repeats.
I would suggest run k-mer based analyses.