Question

K-Mer Counting From Single-End, Paired-End And Mate-Pair Reads

1

Entering edit mode

11.5 years ago

Alice ▴ 320

Hello, biostars!

I had one small issue, which grew up in a big question.

My issue is following: i'm going to count k-mers and it's frequency (may be depth) on mouse wgs by Illumina (in jellyfish software). Is it correct to combine 2 fastq files into one just by bash 'cut' command? And is it wise to state '-C' flag in jellyfish? Or i need to filter reads by quality before counting k-mers?

If i will do the same on SE reads, will i get similar or identical results? I understand, that for the genome it does not matter PE or SE we use. Correct me if i'm wrong, but as i think, between SE, PE and MP reads there will be difference in unique k-mers, because of ngs technology procedures? Thus, for k-mer counting the comparison of uniqueness value from SE and PE technologies is not correct.

ngs • 9.2k views

ADD COMMENT • link updated 11.5 years ago by JacobS ▴ 990 • written 11.5 years ago by Alice ▴ 320

0

Entering edit mode

I'm interested in this topic because I've been writing perl script for kmer counting recently. Can you please explain why you are wanting to count kmer frequency in NGS reads? I can share my script for easy kmer counting if you'd like, and if it would be helpful for your situation.

ADD REPLY • link 11.5 years ago by JacobS ▴ 990

0

Entering edit mode

Firstly why do you want to mix two files ? Keep them single and check the k-mers individually and compare it after combining if needed. I think PE or SE doesn't matter much why can't you use more handy FASTQC tool for the K-mer counting i think it would be better if you use that

ADD REPLY • link 11.5 years ago by ancient_learner ▴ 680

0

Entering edit mode

Good idea, may be i will do that, thank you!

ADD REPLY • link 11.5 years ago by Alice ▴ 320

score 2 · Answer 1 · 2013-10-16

First, your simplest question: combine FASTQ files using cat instead of cut

 cat sample1.fastq sample2.fastq sample3.fastq > total.fastq

Second, I haven't used jellyfish or trinity very much at all, so maybe someone else can answer your question about the -C option. Same goes to whether you'll get different frequencies when counting kmers for SE vs PE vs MP reads. I would expect that the frequencies will be in the same ratio, but perhaps the best way to answer this question is empirically! Why don't you quickly count and review the kmer frequencies of R1/R2/SE/PE/MP and then discuss the results?

Here is a perl script I wrote for looking at frequencies of specific size kmers to get an idea about common kmers and read sequence complexity: https://github.com/jtshreve/BioScripts/blob/master/kmer_counting_in_FASTQ.pl

Give this a try and see how your files differ. If you need to check the metrics for all possible kmers instead of individual kmers, let me know and I can modify my script! Good luck.

Here is an example of my script's output using kmer size 7:

Used kmer_counting_in_FASTQ.pl to find kmer size 7 metrics
Reads file: 1M.fastq

Total number of reads: 1000000
Total number of kmers: 17618

Top 10 kmers combined:       7898965/45000000:  17.55% of all kmers
Top 100 kmers combined:      22211648/45000000:  49.36% of all kmers
Top 1000 kmers combined:     34938588/45000000:  77.64% of all kmers
Top 10000 kmers combined:    43949053/45000000:  97.66% of all kmers

Most common 5 kmers:
863948  TGGAATT
783133  TTCTCGG
729782  CGGGTGC
456897  AAGGAAC
429322  GATTCAG

Least common 5 kmers:
1       NCGCTGG
1       NTCGAAC
1       CAGTCNN
1       GAACNCN
1       NCCCGAT