Extract origin location and name kmers from source sequence

Entering edit mode

3.4 years ago

vivek37373 • 0

Hello everyone,

Let's say I have sequence "ATGCATATATTATATA". I would like to generate all possible 3-mers of this sequence and list their corresponding location and name as such below-

sequence location Origin
ATG 1:3 5'

ATA 13:15 3'

TAT 6:8 i

Here, 5' kmers are those that begin at position 1 of the sequence, 3' kmers are those that end at last position of the sequnece and i kmers are those whose start and end positions are internal to the sequence.

Any help in this regard would be a big help! Thanks in advance.

kmer • 1.5k views

ADD COMMENT • link 3.4 years ago by vivek37373 • 0

Entering edit mode

what's the step size? with window size of 3 and step size of 1, here is the possible solution. As for the 5',3' and int, its a redundant information as the sequence is numbered from 5'to3'.

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}'

ATG 1-3
TGC 2-4
GCA 3-5
CAT 4-6
ATA 5-7
TAT 6-8
ATA 7-9
TAT 8-10
ATT 9-11
TTA 10-12
TAT 11-13
ATA 12-14
TAT 13-15
ATA 14-16

A little bit advanced:

$ echo -e ">seq\nATGCATATATTATATA" | seqkit sliding -W 3 -s 1 | seqkit fx2tab | awk -v OFS="\t" -F ':|\t' '{print $3,$2}' | datamash -sg1 collapse 2  count 2 | sort -Vk2,2

ATG 1-3 1
TGC 2-4 1
GCA 3-5 1
CAT 4-6 1
ATA 5-7,7-9,12-14,14-16 4
TAT 6-8,8-10,11-13,13-15    4
ATT 9-11    1
TTA 10-12   1

last column is count of same kmer appearing in sequence.

ADD REPLY • link 3.4 years ago by cpad0112 21k

Entering edit mode

Thank you so much, @cpad0112. This is useful and i was almost getting to this point using seqtk . However, i failed to classify them as 5', 3' or i which relevance to my target biological question is critical information needed for further analysis. And btw i don't understand, what a step size is. So, i don't know the appropriate number to indicate here.

ADD REPLY • link 3.4 years ago by vivek37373 • 0

Entering edit mode

How do you want to encode repeated k-mers?

ADD REPLY • link 3.4 years ago by Michael 56k

Entering edit mode

I want to note down all the kmers whether repetitive or not, in terms of their location and later, classify it as 5', 3' or i.

ADD REPLY • link 3.4 years ago by vivek37373 • 0

Entering edit mode

I see, but what is i supposed to mean? Also, extracting reverse (unlike reverse complement) k-mers has no biological meaning because DNA or RNA replication and transcription are directional processes, and sequences are commonly understood as being noted in 5'-3' direction. Therefore I recommend treating directionality by running the same k-mer function on the forward sequence and its reverse complement. It depends on your application case how you need to treat strandedness. The only k-mer application that comes to my mind where you need to store the coordinates of origin in addition to the count is repeat detection (like in RepeatScout). So it might help to know the bigger picture.

One more thing, and I am possibly being picky here, because k is constant it is sufficient to store the start position. This may not matter for this toy example, but k-mer algorithms are often used at genome-scale, and then it does matter.

ADD REPLY • link 3.4 years ago by Michael 56k

Entering edit mode

I totally get your point and it is relevant in this case too. I tried to use this one from @arnavm arnavm

	/*
	Copyright 2019 Avinash Ramu & Arnav Moudgil

	Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
	*/

	// Takes a kmer and fasta as input and lists all positions
	// within the fasta for that kmer.
	// Compilation: g++ kmer.cc -o kmer
	// Example usage: ./kmer hg38.fa AT
	#include <iostream>
	#include <fstream>
	#include <sstream>
	#include <algorithm>

	using namespace std;

	// Function that returns reverse complement of a DNA sequence
	string reverseComplement(string seq) {
	string rcseq = seq.substr(); // First copy seq
	reverse(rcseq.begin(), rcseq.end()); // Reverse seq
	for (std::string::size_type i = 0; i < rcseq.size(); i++) {
	switch (rcseq[i]) { // Complement bases
	case 'A':
	rcseq[i] = 'T';
	break;
	case 'C':
	rcseq[i] = 'G';
	break;
	case 'G':
	rcseq[i] = 'C';
	break;
	case 'T':
	rcseq[i] = 'A';
	break;
	// case 'N' is not explicitly handled as N's would remain N
	}
	}
	return rcseq;
	}

	void printKmerEntry(string contig, string kmer, unsigned long position, unsigned long count, char strand, char delim) {
	cout << contig << delim // chrom
	<< position << delim // start
	<< position + kmer.size() << delim // end
	<< contig << '.' << kmer << '.' << count << delim // kmer count
	<< '.' << delim // value (not used)
	<< strand << endl; // strand
	}

	int main(int argc, char* argv[]) {
	if (argc != 3) {
	cerr << "Example usage: kmer hg38.fa AT";
	return 1;
	}
	string kmer = argv[2];
	// Convert to upper-case
	std::transform(kmer.begin(), kmer.end(), kmer.begin(), ::toupper);

	// Check if kmer is a palindrome
	bool palindrome;
	if (kmer == reverseComplement(kmer)) palindrome = true;
	else palindrome = false;

	// Read the FASTA
	ifstream fasta_fh(argv[1]);
	std::string line, contig, prev_line, query;
	unsigned long position; // character position in entry
	char delim = '\t', strand; // delimiter
	unsigned long count; // count of how many hits we have

	while(std::getline(fasta_fh, line).good()) {
	if (line[0] == '>') {
	istringstream iss(line.substr(1));
	iss >> contig; // Get the contig name
	prev_line = "";
	position = 0; // Enables 0-indexing
	count = 0;
	continue;
	}
	// Convert line to upper-case for case-insensitive matching
	std::transform(line.begin(), line.end(), line.begin(), ::toupper);
	// Concatenate trailing part of previous line
	line = prev_line + line;
	// Maximum index to scan along line
	long maxIndex = line.size() - (kmer.size() - 1);

	if (maxIndex > 0) { // Line is longer than kmer
	for (std::string::size_type i = 0; i < maxIndex; i++) {
	// Get a kmer-sized substring
	query = line.substr(i, kmer.size());
	// Check if query string matches kmer
	if (query == kmer) {
	strand = '+';
	count++;
	printKmerEntry(contig, kmer, position, count, strand, delim);
	}
	// Check reverse complement if kmer is not palindrome
	if (!palindrome && reverseComplement(query) == kmer) {
	strand = '-';
	count++;
	printKmerEntry(contig, kmer, position, count, strand, delim);
	}
	position++;
	}
	prev_line = line.substr(line.size() - (kmer.size() - 1), kmer.size() - 1);
	}
	else { // kmer is longer than line
	prev_line = line.substr();
	}
	}

	return 0;
	}

view raw kmer.cc hosted with ❤ by GitHub

. It did get me the ones in complementary but i dont understand how to categorize it. In my case, these individual kmers represents a class of molecule and the sequence from which I want extract this are individual genes.

And by i kmers, it would be those which does not start (1) and/or end (lets say the last nucleotide of the sequnece, 110) at these positions.

ADD REPLY • link 3.4 years ago by vivek37373 • 0

Entering edit mode

If the little C++ I know does not betray me, the code does everything you want and more already. It also checks for palindromes, which in fact we hadn't thought about. However, I fail to see yet how this is useful. I normally think of k-mer extraction of putting the k-mers into a data structure like an associative array or a DeBruijn Graph, organized by their sequence, such that they can be efficiently accessed and statistics can be calculated. Using the redundant output format with one line per position and also for repeated k-mers, doesn't really have any advantage for calculating any statistics based on the k-mers.

In addition, if you are looking at gene sequences, why consider the reverse complement at all?

ADD REPLY • link 3.4 years ago by Michael 56k

Entering edit mode

I see that kmers of the reverse complement could also possibly be a potential molecule I'm looking for. And these are exceptionally short sequences and I would like to confirm these later using expression count. Other than getting all these kmers, my major focus is to classify them. Unfortunately, I'm not sound in programming but I'm trying my best to understand this.

ADD REPLY • link 3.4 years ago by vivek37373 • 0