Question

Assembly and read comparison using kmers

0

Entering edit mode

5.3 years ago

beginner_problem ▴ 10

I am doing some assembly VS read comparison, and I have noticed something which is quite confusing.

I have performed kmer extraction from an assembly file and the corresponding reads (got them from the NCBI SRA and Assembly database), and when I compare them, I have kmers which are present in the assembly but not present in the reads.

So I am wondering if this is possible, and if yes, how?

Assembly sequence • 1.7k views

ADD COMMENT • link 5.3 years ago by beginner_problem ▴ 10

0

Entering edit mode

Ok yes, you are right about that. Did not think about this case.

However I assumed modern assemblers will assemble only highly covered areas (in which case) the kmers in between (so in your example ormati) should also be contained in one of the reads.

ADD REPLY • link 5.3 years ago by beginner_problem ▴ 10

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This belongs under @Wouter's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 5.3 years ago by GenoMax 148k

0

Entering edit mode

Well I hope it does correspond to the same set of reads used for the assembly. I just searched through NCBI, lets take this one for exmple:

https://www.ncbi.nlm.nih.gov/biosample/SAMEA1705947/

and from the "Assembly" database got the .fna files, and from the SRA got the read files. Shouldn't these reads correspond to the ones used for the assembly?

ADD REPLY • link 5.3 years ago by beginner_problem ▴ 10

score 1 · Answer 1 · 2019-09-20

1

Entering edit mode

5.3 years ago

WouterDeCoster 47k

I think that would be possible yeah. I can't tell you the odds, but possible.

Extreme oversimplification, say that I have two reads: Bioinform and rmatics

You could assemble that to Bioinformatics

In the assembly, there is now the kmer ormati which is in neither of the reads.

ADD COMMENT • link 5.3 years ago by WouterDeCoster 47k

score 1 · Answer 2 · 2019-09-20

1

Entering edit mode

5.3 years ago

Corentin ▴ 610

In addition to what WouterDeCoster mentioned, it is also possible that these kmers correspond to misassemblies. In general I try to reduce the numbers of kmers found in the assembly and not in the reads as low as possible.

Are you using different sets of reads for your assembly ? If yes, these kmers could also correspond to a region assembled from other reads.

ADD COMMENT • link 5.3 years ago by Corentin ▴ 610

0

Entering edit mode

Well I hope the reads used for the assembly correspond to the set. What I did is that I found a Biosample on NCBI, like this one https://www.ncbi.nlm.nih.gov/biosample/SAMEA1705947/ and downloaded the linked assembly .fna file and the read files from the linked SRA entry. SO these reads should have been the only ones used for the assembly, or am I wrong in this assumption?

ADD REPLY • link 5.3 years ago by beginner_problem ▴ 10

0

Entering edit mode

Not necessarily, depending on the genome size, complexity and the project's budget, there can be more than one library used for the assembly (sometimes from different technologies as well, for example Illumina + PacBio).

You should have a Bioproject ID associated with your reads and assembly ("PRJEA31233" in your example), which should give you more information about the project.

You can still use only one library if it covers most of the genome (but it also depends on what you want to do and how accurate you need to be).

ADD REPLY • link 5.3 years ago by Corentin ▴ 610

0

Entering edit mode

Thank you for your reply. But if other read sets are used for the assembly, shouldnt they also be linked to the project? I checked out all the links but I usually find one, or maybe two read runs linked to a given sample.

I am very picky about this because i need to be as accurate as possible for my evaluation, so limiting this number of kmers not existing in the assembly set.

ADD REPLY • link 5.2 years ago by beginner_problem ▴ 10

0

Entering edit mode

Yes, everything should be linked to the project.

Don't forget WouterDeCouster answer, all of these kmers are not necessarily mis-assemblies.

ADD REPLY • link 5.2 years ago by Corentin ▴ 610