Extract origin location and name kmers from source sequence
0
0
Entering edit mode
3.4 years ago
vivek37373 • 0

Hello everyone,

Let's say I have sequence "ATGCATATATTATATA". I would like to generate all possible 3-mers of this sequence and list their corresponding location and name as such below-

sequence location Origin
ATG 1:3 5'

ATA 13:15 3'

TAT 6:8 i

Here, 5' kmers are those that begin at position 1 of the sequence, 3' kmers are those that end at last position of the sequnece and i kmers are those whose start and end positions are internal to the sequence.

Any help in this regard would be a big help! Thanks in advance.

kmer • 1.5k views
ADD COMMENT
1
Entering edit mode

what's the step size? with window size of 3 and step size of 1, here is the possible solution. As for the 5',3' and int, its a redundant information as the sequence is numbered from 5'to3'.

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}'

ATG 1-3
TGC 2-4
GCA 3-5
CAT 4-6
ATA 5-7
TAT 6-8
ATA 7-9
TAT 8-10
ATT 9-11
TTA 10-12
TAT 11-13
ATA 12-14
TAT 13-15
ATA 14-16

A little bit advanced:

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}' | datamash -sg1 collapse 2  count 2 | sort -Vk2,2

ATG 1-3 1
TGC 2-4 1
GCA 3-5 1
CAT 4-6 1
ATA 5-7,7-9,12-14,14-16 4
TAT 6-8,8-10,11-13,13-15    4
ATT 9-11    1
TTA 10-12   1

last column is count of same kmer appearing in sequence.

ADD REPLY
0
Entering edit mode

Thank you so much, @cpad0112. This is useful and i was almost getting to this point using seqtk . However, i failed to classify them as 5', 3' or i which relevance to my target biological question is critical information needed for further analysis. And btw i don't understand, what a step size is. So, i don't know the appropriate number to indicate here.

ADD REPLY
0
Entering edit mode

How do you want to encode repeated k-mers?

ADD REPLY
0
Entering edit mode

I want to note down all the kmers whether repetitive or not, in terms of their location and later, classify it as 5', 3' or i.

ADD REPLY
2
Entering edit mode

I see, but what is i supposed to mean? Also, extracting reverse (unlike reverse complement) k-mers has no biological meaning because DNA or RNA replication and transcription are directional processes, and sequences are commonly understood as being noted in 5'-3' direction. Therefore I recommend treating directionality by running the same k-mer function on the forward sequence and its reverse complement. It depends on your application case how you need to treat strandedness. The only k-mer application that comes to my mind where you need to store the coordinates of origin in addition to the count is repeat detection (like in RepeatScout). So it might help to know the bigger picture.

One more thing, and I am possibly being picky here, because k is constant it is sufficient to store the start position. This may not matter for this toy example, but k-mer algorithms are often used at genome-scale, and then it does matter.

ADD REPLY
0
Entering edit mode

I totally get your point and it is relevant in this case too. I tried to use this one from @arnavm arnavm

/*
Copyright 2019 Avinash Ramu & Arnav Moudgil
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
// Takes a kmer and fasta as input and lists all positions
// within the fasta for that kmer.
// Compilation: g++ kmer.cc -o kmer
// Example usage: ./kmer hg38.fa AT
#include <iostream>
#include <fstream>
#include <sstream>
#include <algorithm>
using namespace std;
// Function that returns reverse complement of a DNA sequence
string reverseComplement(string seq) {
string rcseq = seq.substr(); // First copy seq
reverse(rcseq.begin(), rcseq.end()); // Reverse seq
for (std::string::size_type i = 0; i < rcseq.size(); i++) {
switch (rcseq[i]) { // Complement bases
case 'A':
rcseq[i] = 'T';
break;
case 'C':
rcseq[i] = 'G';
break;
case 'G':
rcseq[i] = 'C';
break;
case 'T':
rcseq[i] = 'A';
break;
// case 'N' is not explicitly handled as N's would remain N
}
}
return rcseq;
}
void printKmerEntry(string contig, string kmer, unsigned long position, unsigned long count, char strand, char delim) {
cout << contig << delim // chrom
<< position << delim // start
<< position + kmer.size() << delim // end
<< contig << '.' << kmer << '.' << count << delim // kmer count
<< '.' << delim // value (not used)
<< strand << endl; // strand
}
int main(int argc, char* argv[]) {
if (argc != 3) {
cerr << "Example usage: kmer hg38.fa AT";
return 1;
}
string kmer = argv[2];
// Convert to upper-case
std::transform(kmer.begin(), kmer.end(), kmer.begin(), ::toupper);
// Check if kmer is a palindrome
bool palindrome;
if (kmer == reverseComplement(kmer)) palindrome = true;
else palindrome = false;
// Read the FASTA
ifstream fasta_fh(argv[1]);
std::string line, contig, prev_line, query;
unsigned long position; // character position in entry
char delim = '\t', strand; // delimiter
unsigned long count; // count of how many hits we have
while(std::getline(fasta_fh, line).good()) {
if (line[0] == '>') {
istringstream iss(line.substr(1));
iss >> contig; // Get the contig name
prev_line = "";
position = 0; // Enables 0-indexing
count = 0;
continue;
}
// Convert line to upper-case for case-insensitive matching
std::transform(line.begin(), line.end(), line.begin(), ::toupper);
// Concatenate trailing part of previous line
line = prev_line + line;
// Maximum index to scan along line
long maxIndex = line.size() - (kmer.size() - 1);
if (maxIndex > 0) { // Line is longer than kmer
for (std::string::size_type i = 0; i < maxIndex; i++) {
// Get a kmer-sized substring
query = line.substr(i, kmer.size());
// Check if query string matches kmer
if (query == kmer) {
strand = '+';
count++;
printKmerEntry(contig, kmer, position, count, strand, delim);
}
// Check reverse complement if kmer is not palindrome
if (!palindrome && reverseComplement(query) == kmer) {
strand = '-';
count++;
printKmerEntry(contig, kmer, position, count, strand, delim);
}
position++;
}
prev_line = line.substr(line.size() - (kmer.size() - 1), kmer.size() - 1);
}
else { // kmer is longer than line
prev_line = line.substr();
}
}
return 0;
}
view raw kmer.cc hosted with ❤ by GitHub
. It did get me the ones in complementary but i dont understand how to categorize it. In my case, these individual kmers represents a class of molecule and the sequence from which I want extract this are individual genes.

And by i kmers, it would be those which does not start (1) and/or end (lets say the last nucleotide of the sequnece, 110) at these positions.

ADD REPLY
1
Entering edit mode

If the little C++ I know does not betray me, the code does everything you want and more already. It also checks for palindromes, which in fact we hadn't thought about. However, I fail to see yet how this is useful. I normally think of k-mer extraction of putting the k-mers into a data structure like an associative array or a DeBruijn Graph, organized by their sequence, such that they can be efficiently accessed and statistics can be calculated. Using the redundant output format with one line per position and also for repeated k-mers, doesn't really have any advantage for calculating any statistics based on the k-mers.

In addition, if you are looking at gene sequences, why consider the reverse complement at all?

ADD REPLY
0
Entering edit mode

I see that kmers of the reverse complement could also possibly be a potential molecule I'm looking for. And these are exceptionally short sequences and I would like to confirm these later using expression count. Other than getting all these kmers, my major focus is to classify them. Unfortunately, I'm not sound in programming but I'm trying my best to understand this.

ADD REPLY

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6