Extract origin location and name kmers from source sequence
0
0
Entering edit mode
2.7 years ago
vivek37373 • 0

Hello everyone,

Let's say I have sequence "ATGCATATATTATATA". I would like to generate all possible 3-mers of this sequence and list their corresponding location and name as such below-

sequence location Origin
ATG 1:3 5'

ATA 13:15 3'

TAT 6:8 i

Here, 5' kmers are those that begin at position 1 of the sequence, 3' kmers are those that end at last position of the sequnece and i kmers are those whose start and end positions are internal to the sequence.

Any help in this regard would be a big help! Thanks in advance.

kmer • 1.2k views
ADD COMMENT
1
Entering edit mode

what's the step size? with window size of 3 and step size of 1, here is the possible solution. As for the 5',3' and int, its a redundant information as the sequence is numbered from 5'to3'.

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}'

ATG 1-3
TGC 2-4
GCA 3-5
CAT 4-6
ATA 5-7
TAT 6-8
ATA 7-9
TAT 8-10
ATT 9-11
TTA 10-12
TAT 11-13
ATA 12-14
TAT 13-15
ATA 14-16

A little bit advanced:

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}' | datamash -sg1 collapse 2  count 2 | sort -Vk2,2

ATG 1-3 1
TGC 2-4 1
GCA 3-5 1
CAT 4-6 1
ATA 5-7,7-9,12-14,14-16 4
TAT 6-8,8-10,11-13,13-15    4
ATT 9-11    1
TTA 10-12   1

last column is count of same kmer appearing in sequence.

ADD REPLY
0
Entering edit mode

Thank you so much, @cpad0112. This is useful and i was almost getting to this point using seqtk . However, i failed to classify them as 5', 3' or i which relevance to my target biological question is critical information needed for further analysis. And btw i don't understand, what a step size is. So, i don't know the appropriate number to indicate here.

ADD REPLY
0
Entering edit mode

How do you want to encode repeated k-mers?

ADD REPLY
0
Entering edit mode

I want to note down all the kmers whether repetitive or not, in terms of their location and later, classify it as 5', 3' or i.

ADD REPLY
2
Entering edit mode

I see, but what is i supposed to mean? Also, extracting reverse (unlike reverse complement) k-mers has no biological meaning because DNA or RNA replication and transcription are directional processes, and sequences are commonly understood as being noted in 5'-3' direction. Therefore I recommend treating directionality by running the same k-mer function on the forward sequence and its reverse complement. It depends on your application case how you need to treat strandedness. The only k-mer application that comes to my mind where you need to store the coordinates of origin in addition to the count is repeat detection (like in RepeatScout). So it might help to know the bigger picture.

One more thing, and I am possibly being picky here, because k is constant it is sufficient to store the start position. This may not matter for this toy example, but k-mer algorithms are often used at genome-scale, and then it does matter.

ADD REPLY
0
Entering edit mode

I totally get your point and it is relevant in this case too. I tried to use this one from @arnavm arnavm . It did get me the ones in complementary but i dont understand how to categorize it. In my case, these individual kmers represents a class of molecule and the sequence from which I want extract this are individual genes.

And by i kmers, it would be those which does not start (1) and/or end (lets say the last nucleotide of the sequnece, 110) at these positions.

ADD REPLY
1
Entering edit mode

If the little C++ I know does not betray me, the code does everything you want and more already. It also checks for palindromes, which in fact we hadn't thought about. However, I fail to see yet how this is useful. I normally think of k-mer extraction of putting the k-mers into a data structure like an associative array or a DeBruijn Graph, organized by their sequence, such that they can be efficiently accessed and statistics can be calculated. Using the redundant output format with one line per position and also for repeated k-mers, doesn't really have any advantage for calculating any statistics based on the k-mers.

In addition, if you are looking at gene sequences, why consider the reverse complement at all?

ADD REPLY
0
Entering edit mode

I see that kmers of the reverse complement could also possibly be a potential molecule I'm looking for. And these are exceptionally short sequences and I would like to confirm these later using expression count. Other than getting all these kmers, my major focus is to classify them. Unfortunately, I'm not sound in programming but I'm trying my best to understand this.

ADD REPLY

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6