Build Graph lost k-mers
2
0
Entering edit mode
9.7 years ago
lardo • 0

I have a FASTA file contains unique k-mers with the read length is the k-mer size:

>kmer_1
ATGACAGCCTTTTTTAAA
>kmer_2
ATGACAGCCTTTTTTAAT

Then I used the API of gatb-core-1.0.6-Linux :

Graph::create ((char const *)"-in %s -kmer-size %d -abundance-min 1 -nb-cores %d -out %s", xxxx);

There should be 11,571,887 unique k-mer in my file, but the graph build by this file contains only 11,065,132 unique k-mers.

I think this program lost some useful k-mers while storing k-mers.

GATB • 2.7k views
ADD COMMENT
0
Entering edit mode
9.7 years ago
edrezen ▴ 730

Hello,

It is possible that you have N characters inside your data. In such a case, no valid kmer can be built, so the read having one or several N won't be used.

Could you check this is the case of your input ?

ADD COMMENT
0
Entering edit mode

No 'N' character in the reads.The lost k-mers can be found in my file but not contained in the graph.

ADD REPLY
0
Entering edit mode
9.7 years ago

Perhaps the program is storing reverse-complements in a canonical fashion, so they are only represented once. That's fairly typical.

You can count kmers with BBTools and, using the 'rcomp' flag, enable or disable storing of kmers and their reverse-complements independently, to get the count each way:

kmercountexact.sh in=file.fasta k=18 rcomp=t
kmercountexact.sh in=file.fasta k=18 rcomp=f
ADD COMMENT
0
Entering edit mode

Yes GATB does collapse each k-mer and its reverse complement into a single canonical kmer.

ADD REPLY
0
Entering edit mode

If this answer does not solve your problem, would you mind posting the dataset? (I'm assuming this is a ~200 MB file, possibly much less if gzipped)

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6