Question

Extract origin location and name kmers from source sequence

0

Entering edit mode

2.7 years ago

vivek37373 • 0

Hello everyone,

Let's say I have sequence "ATGCATATATTATATA". I would like to generate all possible 3-mers of this sequence and list their corresponding location and name as such below-

sequence location Origin
ATG 1:3 5'

ATA 13:15 3'

TAT 6:8 i

Here, 5' kmers are those that begin at position 1 of the sequence, 3' kmers are those that end at last position of the sequnece and i kmers are those whose start and end positions are internal to the sequence.

Any help in this regard would be a big help! Thanks in advance.

kmer • 1.2k views

ADD COMMENT • link 2.7 years ago by vivek37373 • 0

1

Entering edit mode

what's the step size? with window size of 3 and step size of 1, here is the possible solution. As for the 5',3' and int, its a redundant information as the sequence is numbered from 5'to3'.

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}'

ATG 1-3
TGC 2-4
GCA 3-5
CAT 4-6
ATA 5-7
TAT 6-8
ATA 7-9
TAT 8-10
ATT 9-11
TTA 10-12
TAT 11-13
ATA 12-14
TAT 13-15
ATA 14-16

A little bit advanced:

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}' | datamash -sg1 collapse 2  count 2 | sort -Vk2,2

ATG 1-3 1
TGC 2-4 1
GCA 3-5 1
CAT 4-6 1
ATA 5-7,7-9,12-14,14-16 4
TAT 6-8,8-10,11-13,13-15    4
ATT 9-11    1
TTA 10-12   1

last column is count of same kmer appearing in sequence.

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much, @cpad0112. This is useful and i was almost getting to this point using seqtk . However, i failed to classify them as 5', 3' or i which relevance to my target biological question is critical information needed for further analysis. And btw i don't understand, what a step size is. So, i don't know the appropriate number to indicate here.

ADD REPLY • link 2.7 years ago by vivek37373 • 0

0

Entering edit mode

How do you want to encode repeated k-mers?

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I want to note down all the kmers whether repetitive or not, in terms of their location and later, classify it as 5', 3' or i.

ADD REPLY • link 2.7 years ago by vivek37373 • 0

2

Entering edit mode

I see, but what is i supposed to mean? Also, extracting reverse (unlike reverse complement) k-mers has no biological meaning because DNA or RNA replication and transcription are directional processes, and sequences are commonly understood as being noted in 5'-3' direction. Therefore I recommend treating directionality by running the same k-mer function on the forward sequence and its reverse complement. It depends on your application case how you need to treat strandedness. The only k-mer application that comes to my mind where you need to store the coordinates of origin in addition to the count is repeat detection (like in RepeatScout). So it might help to know the bigger picture.

One more thing, and I am possibly being picky here, because k is constant it is sufficient to store the start position. This may not matter for this toy example, but k-mer algorithms are often used at genome-scale, and then it does matter.

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I totally get your point and it is relevant in this case too. I tried to use this one from @arnavm arnavm . It did get me the ones in complementary but i dont understand how to categorize it. In my case, these individual kmers represents a class of molecule and the sequence from which I want extract this are individual genes.

And by i kmers, it would be those which does not start (1) and/or end (lets say the last nucleotide of the sequnece, 110) at these positions.

ADD REPLY • link 2.7 years ago by vivek37373 • 0

1

Entering edit mode

If the little C++ I know does not betray me, the code does everything you want and more already. It also checks for palindromes, which in fact we hadn't thought about. However, I fail to see yet how this is useful. I normally think of k-mer extraction of putting the k-mers into a data structure like an associative array or a DeBruijn Graph, organized by their sequence, such that they can be efficiently accessed and statistics can be calculated. Using the redundant output format with one line per position and also for repeated k-mers, doesn't really have any advantage for calculating any statistics based on the k-mers.

In addition, if you are looking at gene sequences, why consider the reverse complement at all?

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I see that kmers of the reverse complement could also possibly be a potential molecule I'm looking for. And these are exceptionally short sequences and I would like to confirm these later using expression count. Other than getting all these kmers, my major focus is to classify them. Unfortunately, I'm not sound in programming but I'm trying my best to understand this.

ADD REPLY • link 2.7 years ago by vivek37373 • 0