Hello,
I need to extract kmers and their counts per contig in an assembly file and I was wondering what would be the most efficient way to do this?
For previous full genome kmer counts I've used BBTools kmercountexact.sh and I have considered ways to fed each scaffold into that program, but I have two issues with that potential solution. The first is the sheer number of output files that would result from doing that, although I guess I could just cat them all at the end. The second is I am very unfamiliar with awk/ bioawk and so while I know bioawk allows you to extract sequences very easily I don't know how to set up a for loop using awk/bioawk to do this and then pipe the contigs into another program.
Would anyone be kind enough to help me with this or direct me to a more appropriate solution?
Thank you!
You mean split the multifasta file into individual contigs? See here: Splitting A Fasta File
Hi Asaf,
Not really. I want to pipe each contig into kmer counting software. I could split them into multiple files, and feed them in individually I suppose but I'm going to imagine that has a high I/O cost that isnt overly efficient although it would certainly achieve what I need I suppose.