I have a fastq file. I want to have k-mers and their abundance numbers. can you give me some tools to do this! thank you
I have a fastq file. I want to have k-mers and their abundance numbers. can you give me some tools to do this! thank you
If you're working with small values of k, I wrote a command-line k-mer counter called kmer-counter
that will output results in a form that a Python script can consume:
https://github.com/alexpreynolds/kmer-counter
You can grab, build and install it like so:
$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin
Once the binary is in your path, you might use it in Python like so:
k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
output = subprocess.check_output(kmerCmd, shell=True)
result = {}
for line in output.splitlines():
(header, counts) = line.strip().split('\t')
header = header[1:]
kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
result[header] = kmers
sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
sys.stderr.write("%s" % (str(error)))
Given example FASTA like this:
>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC
For k=6, you would get an iterable Python dictionary like this:
{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}
You can use standard Python calls to manipulate this dictionary object and get sums of counts per record, for sequence, etc. which seems to answer your question. Please feel free to clarify what you're looking for if this object representation is not clear.
For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py
If you are working with large values of k, then you might look at other tools, like Jellyfish. They use data structures to give an approximate count for large values of k that are not tenable for strict word counts.
You can use Jellyfish, KMC3, kmerfreq, BFCounter, ntCard . . .
i would prefer this one: https://github.com/lh3/yak by Heng Li
A canonical 31-mer can be represented with a 64-bit unsigned integer. Larger word lengths could be used at the expense of more memory and time, so that's a common cutoff when using that data structure to store a kmer key for presence/absence or counts.
If you use a larger k, tools like Jellyfish use a data structure called a Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter) to be able to manage performance and storage, with the trade-off of giving an inexact count.
Just point the binary at the file. See https://github.com/alexpreynolds/kmer-counter/blob/master/test/makefile#L19 for a basic example. If you have a lot of files, iterate through a bash array or similar scripted approach.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What were some tools you came across when you googled how to do this?
i have not google it , i ll try