Entering edit mode
4.0 years ago
anasjamshed
▴
140
I want to write a program that
- calculates the number of all kmers of a given length across all DNA sequences
- displays just the ones that occur more than a given a number of times.
I have tried this script:
import os
import sys
import shutil
# convert command line arguments to variables
kmer_size = int(sys.argv[1])
print(kmer_size)
count_cutoff = int(sys.argv[2])
# define the function to split dna
def split_dna(dna, kmer_size):
kmers = []
for start in range(0,len(dna)-kmer_size-1,1):
kmer = dna[start:start+kmer_size]
kmers.append(kmer)
return kmers
# create an empty dictionary to hold the counts
kmer_counts = {}
# process each file with the right name
for file_name in os.listdir("."):
if file_name.endswith(".fastq"):
dna_file = open(file_name)
# process each DNA sequence in a file
for line in dna_file:
dna = line.rstrip("\n")
# increase the count for each k-mer that we find
for kmer in split_dna(dna, kmer_size):
current_count = kmer_counts.get(kmer, 0)
new_count = current_count + 1
kmer_counts[kmer] = new_count
# print k-mers whose counts are above the cutoff
for kmer, count in kmer_counts.items():
if count > count_cutoff:
print(kmer + " : " + str(count))
But it gives an error:
ValueError Traceback (most recent call last)
<ipython-input-42-02b791e42fca> in <module>()
4
5 # convert command line arguments to variables
----> 6 kmer_size = int(sys.argv[1])
7 print(kmer_size)
8 count_cutoff = int(sys.argv[2])
ValueError: invalid literal for int() with base 10: '-f'
I have been trying from last 3 months I don't know he can I execute it? I can't change the type of any variable
Kindly help me
It looks like you're using ipython notebook - correct me if that's not the case.
How are you executing the script?
yes I am using python notebook
Save the script as a
.py
file and try executing it from the command line. It looks like you're not passing in command line arguments properly through the ipython interface.how can I run through command line after creating .py file
Please google "run python file on command line"
What arguments are you passing to this script? As the error notes,
-f
is not an integer (which would be your kmer size).yes it's not integer I know but I also tried to do by taking string but it also gimme error
There are a number of threads on Biostars about enumerating kmers. You should be able to adapt one of those fairly trivially:
How to count and compare k-mer count vectors (and print the top ten highest contributions)?
Finding 16 mer not present in GRCh38
A: Randomly sample motifs from reference sequence
See also the linked gist