dear BioStars users,
I would like to extract from my pair-end fastq files information how many times my read is occurring in my fastq file.
So output could look -
my read (sequence) - how many times I found it in fastq file :
CCGGCTCGC - 140x CTTCGCGCC - 2x
I tried to use awk to comparing all reads to each other, but it does not work very well :-(
Is there any tool or idea how to compare all reads to each other and extract how many times is occurring my reads in fastq file?
Thank you so much for any idea and help! I hope my question is clear..
Paul.
Have you tried just using FastQC, which does kmer enrichment as part of its normal workflow?
Thank you very much for reply.
I tried FastQC and it works perfectly fine. But their modules - overrepresented sequences is perfect, but any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. And I would like to compare all read length (maybe with a few mismatches).
It is hard to say, if it is possible..
Ah, I hadn't realized that. Other options that you might look into would be this program or the QRQC package in bioconductor, which is supposed to also do k-mer calculations.