Assembly and read comparison using kmers
2
0
Entering edit mode
5.2 years ago

I am doing some assembly VS read comparison, and I have noticed something which is quite confusing.

I have performed kmer extraction from an assembly file and the corresponding reads (got them from the NCBI SRA and Assembly database), and when I compare them, I have kmers which are present in the assembly but not present in the reads.

So I am wondering if this is possible, and if yes, how?

Assembly sequence • 1.7k views
ADD COMMENT
0
Entering edit mode

Ok yes, you are right about that. Did not think about this case.

However I assumed modern assemblers will assemble only highly covered areas (in which case) the kmers in between (so in your example ormati) should also be contained in one of the reads.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This belongs under @Wouter's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLY
0
Entering edit mode

Well I hope it does correspond to the same set of reads used for the assembly. I just searched through NCBI, lets take this one for exmple:

https://www.ncbi.nlm.nih.gov/biosample/SAMEA1705947/

and from the "Assembly" database got the .fna files, and from the SRA got the read files. Shouldn't these reads correspond to the ones used for the assembly?

ADD REPLY
1
Entering edit mode
5.2 years ago

I think that would be possible yeah. I can't tell you the odds, but possible.

Extreme oversimplification, say that I have two reads: Bioinform and rmatics

You could assemble that to Bioinformatics

In the assembly, there is now the kmer ormati which is in neither of the reads.

ADD COMMENT
1
Entering edit mode
5.2 years ago
Corentin ▴ 610

In addition to what WouterDeCoster mentioned, it is also possible that these kmers correspond to misassemblies. In general I try to reduce the numbers of kmers found in the assembly and not in the reads as low as possible.

Are you using different sets of reads for your assembly ? If yes, these kmers could also correspond to a region assembled from other reads.

ADD COMMENT
0
Entering edit mode

Well I hope the reads used for the assembly correspond to the set. What I did is that I found a Biosample on NCBI, like this one https://www.ncbi.nlm.nih.gov/biosample/SAMEA1705947/ and downloaded the linked assembly .fna file and the read files from the linked SRA entry. SO these reads should have been the only ones used for the assembly, or am I wrong in this assumption?

ADD REPLY
0
Entering edit mode

Not necessarily, depending on the genome size, complexity and the project's budget, there can be more than one library used for the assembly (sometimes from different technologies as well, for example Illumina + PacBio).

You should have a Bioproject ID associated with your reads and assembly ("PRJEA31233" in your example), which should give you more information about the project.

You can still use only one library if it covers most of the genome (but it also depends on what you want to do and how accurate you need to be).

ADD REPLY
0
Entering edit mode

Thank you for your reply. But if other read sets are used for the assembly, shouldnt they also be linked to the project? I checked out all the links but I usually find one, or maybe two read runs linked to a given sample.

I am very picky about this because i need to be as accurate as possible for my evaluation, so limiting this number of kmers not existing in the assembly set.

ADD REPLY
0
Entering edit mode

Yes, everything should be linked to the project.

Don't forget WouterDeCouster answer, all of these kmers are not necessarily mis-assemblies.

ADD REPLY

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6