we should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two.
Can anybody explain the statement in bold with an example? I know what is the meaning of k-mer and counting k-mers in reads data set. But i'm not able to understand the above statement. Thanks in advance
For example, the 3-mer TAC is actually the reverse complement of the k-mer GTA. So the words reported are printed in a mixture of both the forward and reverse complements. As a sequence is scanned for 3-mers, both counts of forward and reverse complement of the word are calculated, which usually are different from each other, except in the case of reverse palindromes. To save space and for efficiency, the words are only stored once. The choice of whether a word is stored and printed in the forward or reverse direction is determined by alphabetic order. Therefore, GTA is "canoncial 3-mer" and it is also stored as TAC.
I tried to run BFCounter which is a k-mer counter software on the following fastq data set which contains only 2 reads and each read of length 49 bp. (This data set is a toy data set I'm using to understand BFCounter).
When i ran BFCounter on the data set, value, I've chosen for k-mer is 25. As there are 49 bp in a read, total number of k-mers to be generated should be 25 as there are 25 distinct k-mers and I'm getting 25 k-mers from BFCounter. The output of BFCounter is as follows.
The number of k-mers BFcounter produsing is 25 and it is correct. But when i looked at k-mer content i don't feel they are correct and proper one. Can you tell me why this difference?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
They look suspicious. I'm not familiar with the software. It will be a lot easier to understand an output if you use small k-mers (for example = 1 or 2).
Could you please explain what do column 2 and 3 mean? I assume they're k-mer counts...
EDIT: @wouter i've seen what you wrote and deleted :D For clarification: I mean column 2 and 3 because I am referring to column 1 as the actual k-mers.
Actually, in the output produced by BFCounter only two columns are there. 1st column is k-mer and 2nd column is k-mer count. Here 2nd column(k-mer count columns) has come as a 3rd column for few lines due to alignment issue.
we should not distinguish between a k-mer and its reversed complement
When you count k-mers, you shouldn't count as two different k-mers the ones that are reverse complemented (they might come from the same genomic region).
we will mean the lexicographically smaller of the two
This is a bit awkward to read, since they've the same word size. Waiting for others to kick-in.
EDIT: On a second thought, I think they're trying to state the criterion they superimposed on their scripts to tell the computer which of the two reverse complemented k-mers to keep in this situation. Using the lexico-graphical ordering.
A note to your post: the title is 3-lines long. Could you edit it keeping only "Question: Can anybody explain what this statement says?". It's informative enough.
I tried to run BFCounter which is a k-mer counter software on the following fastq data set which contains only 2 reads and each read of length 49 bp. (This data set is a toy data set I'm using to understand BFCounter).
When i ran BFCounter on the data set, value, I've chosen for k-mer is 25. As there are 49 bp in a read, total number of k-mers to be generated should be 25 as there are 25 distinct k-mers and I'm getting 25 k-mers from BFCounter. The output of BFCounter is as follows.
The number of k-mers BFcounter produsing is 25 and it is correct. But when i looked at k-mer content i don't feel they are correct and proper one. Can you tell me why this difference?
Thanks for your reply.
I tried to run BFCounter which is a k-mer counter software on the following fastq data set which contains only 2 reads and each read of length 49 bp. (This data set is a toy data set I'm using to understand BFCounter).
When i ran BFCounter on the data set, value, I've chosen for k-mer is 25. As there are 49 bp in a read, total number of k-mers to be generated should be 25 as there are 25 distinct k-mers and I'm getting 25 k-mers from BFCounter. The output of BFCounter is as follows.
The number of k-mers BFcounter produsing is 25 and it is correct. But when i looked at k-mer content i don't feel they are correct and proper one. Can you tell me why this difference?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
They look suspicious. I'm not familiar with the software. It will be a lot easier to understand an output if you use small k-mers (for example = 1 or 2).
Could you please explain what do column 2 and 3 mean? I assume they're k-mer counts...
EDIT: @wouter i've seen what you wrote and deleted :D For clarification: I mean column 2 and 3 because I am referring to column 1 as the actual k-mers.
Actually, in the output produced by BFCounter only two columns are there. 1st column is k-mer and 2nd column is k-mer count. Here 2nd column(k-mer count columns) has come as a 3rd column for few lines due to alignment issue.