K-Mer Counting From Single-End, Paired-End And Mate-Pair Reads
1
1
Entering edit mode
11.1 years ago
Alice ▴ 320

Hello, biostars!

I had one small issue, which grew up in a big question.

My issue is following: i'm going to count k-mers and it's frequency (may be depth) on mouse wgs by Illumina (in jellyfish software). Is it correct to combine 2 fastq files into one just by bash 'cut' command? And is it wise to state '-C' flag in jellyfish? Or i need to filter reads by quality before counting k-mers?

If i will do the same on SE reads, will i get similar or identical results? I understand, that for the genome it does not matter PE or SE we use. Correct me if i'm wrong, but as i think, between SE, PE and MP reads there will be difference in unique k-mers, because of ngs technology procedures? Thus, for k-mer counting the comparison of uniqueness value from SE and PE technologies is not correct.

ngs • 8.9k views
ADD COMMENT
0
Entering edit mode

I'm interested in this topic because I've been writing perl script for kmer counting recently. Can you please explain why you are wanting to count kmer frequency in NGS reads? I can share my script for easy kmer counting if you'd like, and if it would be helpful for your situation.

ADD REPLY
0
Entering edit mode

Firstly why do you want to mix two files ? Keep them single and check the k-mers individually and compare it after combining if needed. I think PE or SE doesn't matter much why can't you use more handy FASTQC tool for the K-mer counting i think it would be better if you use that

ADD REPLY
0
Entering edit mode

Good idea, may be i will do that, thank you!

ADD REPLY
2
Entering edit mode
11.1 years ago
JacobS ▴ 990

First, your simplest question: combine FASTQ files using cat instead of cut

 cat sample1.fastq sample2.fastq sample3.fastq > total.fastq

Second, I haven't used jellyfish or trinity very much at all, so maybe someone else can answer your question about the -C option. Same goes to whether you'll get different frequencies when counting kmers for SE vs PE vs MP reads. I would expect that the frequencies will be in the same ratio, but perhaps the best way to answer this question is empirically! Why don't you quickly count and review the kmer frequencies of R1/R2/SE/PE/MP and then discuss the results?

Here is a perl script I wrote for looking at frequencies of specific size kmers to get an idea about common kmers and read sequence complexity: https://github.com/jtshreve/BioScripts/blob/master/kmer_counting_in_FASTQ.pl

Give this a try and see how your files differ. If you need to check the metrics for all possible kmers instead of individual kmers, let me know and I can modify my script! Good luck.

Here is an example of my script's output using kmer size 7:

Used kmer_counting_in_FASTQ.pl to find kmer size 7 metrics
Reads file: 1M.fastq

Total number of reads: 1000000
Total number of kmers: 17618

Top 10 kmers combined:       7898965/45000000:  17.55% of all kmers
Top 100 kmers combined:      22211648/45000000:  49.36% of all kmers
Top 1000 kmers combined:     34938588/45000000:  77.64% of all kmers
Top 10000 kmers combined:    43949053/45000000:  97.66% of all kmers

Most common 5 kmers:
863948  TGGAATT
783133  TTCTCGG
729782  CGGGTGC
456897  AAGGAAC
429322  GATTCAG

Least common 5 kmers:
1       NCGCTGG
1       NTCGAAC
1       CAGTCNN
1       GAACNCN
1       NCCCGAT
ADD COMMENT
0
Entering edit mode

Thank you! Very useful script. I'm not familiar with perl (i prefer python usually), but i will try it!

ADD REPLY
0
Entering edit mode

Great, hopefully you'll like it. If you aren't too familiar with perl, no problem... download the linked file into your UNIX environment using something like wget https://raw.github.com/jtshreve/BioScripts/master/kmer_counting_in_FASTQ.pl, and then run the script like this: perl kmer_counting_in_FASTQ.pl my_sample.fastq 7 (this checks for kmers of size 7)

ADD REPLY
0
Entering edit mode

Jacob, I counted kmers by your script and get interesting results http://pastebin.com/HsJ3y2iH

First, many N in HiSeq reads, I guess, for kmer counting I should delete them. Second: 951272 ACACACA 865399 CACACAC vs. 893841 ACACACA 877559 CTGCTTG Two different files and on the second place 2 different words. In my opinion it is the sign that i need wisely concatenate two files, not just by "cat".

ADD REPLY

Login before adding your answer.

Traffic: 2995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6