Question

How to cluster reads within the fastq file? Is their any tool?

0

Entering edit mode

7.8 years ago

bioinforesearchquestions ▴ 370

Hi folks,

I am planning to cluster similar reads from a fastq file.

For example:

read1 (length-30): AGTCGATCGATCGAGTCTGCGTCGATCGGG(30 bases)

read2 (length-28): AGTCGATCGATCGAGTCTGCGTCGATCG (28 bases are matching)

read3 (length-25): - - - CGATCGATCGAGTCTGCGTCGAT - - (25 bases are matching)

read4 (length-30): CGAGTCTGCGTCTCGAGTCTTCGAGTCTGA (30 bases)

read5 (length-27): CGAGTCTGCGTCTCGAGTCTTCGAGTC (27 bases are matching)

read6 (length-23): - - -GTCTGCGTCTCGAGTCTTCGA - - - - (23 bases are matching)

readN: ATCGATCGAGTCTGCGTGCGTCTCGAGTCTT (30 bases)

Now I need to cluster (read1,read2, read3 together) similarly need to cluster (read4, read5, read 6).

Expected Output:

Sequence ------------------- Reads that falls with this sequence ------- Frequency

AGTCGATCGATCGAGTCTGCGTCGATCGGG - Read 1, Read 2, Read 3  ----------------------- 3
CGAGTCTGCGTCTCGAGTCTTCGAGTCTGA - Read 4, Read 5, Read 6  ----------------------- 3
ATCGATCGAGTCTGCGTGCGTCTCGAGTCTT - Read a, Read b, Read c, Read  d -------------- 4

alignment Assembly clustering fastq reads • 3.8k views

ADD COMMENT • link updated 7.8 years ago by Alex Reynolds 36k • written 7.8 years ago by bioinforesearchquestions ▴ 370

score 1 · Answer 1 · 2017-03-21

1

Entering edit mode

7.8 years ago

GenoMax 148k

Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

Don't think there is any ready to use tool that will give you that exact expected output.

ADD COMMENT • link 7.8 years ago by GenoMax 148k

0

Entering edit mode

Thanks a lot, Genomax2. I will read the post and play around with the tool. The reason why I would like to cluster the reads is, I am working on miRNAs which have conserved regions in them.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

The clustering part would work without any problem with clumpify.sh. I had asked Brian Bushnell to put a feature in for the counts but I don't think that has been implemented in yet.

ADD REPLY • link 7.8 years ago by GenoMax 148k

0

Entering edit mode

Thanks, Genomax2. I will write a perl script to calculate the frequency.

Firstly, I trimmed off the 3'adapter from our reads using Trimmomatic tool. Then aligned my trimmed reads against the all_mature_miRNA_sequence.fa from miRBase database using the Bowtie2 alignment tool. I have 2,148,364 reads in my sample. But the alignment score was just 2%.

Bowtie2 Stats: Aligned 0 times: 2,104,932 - Aligned 1 time: 2,770 - Aligned > 1 times: 40,662

If possible, can you look into the following posts too?

Question: mirBase mature microRNA sequences have base U instead of base T. Should I change or not?

Question: Do I need to download any specific adapters for Illumina small RNA sequencing kit?

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

You may want to cluster first with clumpify and then do trimming with bbduk (both from bbmap, there is an adapters.fa file that contains all commonly used adapter sequences in the resources directory that you can use). @Brian has suggestions for aligning miRNA data. Let me see if I can find that thread/you can search yourself.

You may also want to use bowtie v.1 since these are small RNA's and you don't expect gapped alignments.

ADD REPLY • link 7.8 years ago by GenoMax 148k

0

Entering edit mode

Oh, I used Bowtie2. I will try with Bowtie1 now. I will also search for Brian Bushnell post on miRNA data.

ADD REPLY • link 7.8 years ago by bioinforesearchquestions ▴ 370

score 1 · Answer 2 · 2017-03-21

1

Entering edit mode

7.8 years ago

Alex Reynolds 36k

You can use R to do supervised hierarchical clustering of sequence strings, using edit or Levenshtein distance as your distance metric.

SO: Text clustering with Levenshtein distances

With some R chops, you should be able to annotate the tree leaf labels (sequences) with the read IDs they come from, assuming they have a one-to-one relationship (i.e., sequences are unique in your FASTQ file).

Further, here's a Biostars post that shows an example of using R cutree() to generate subclusters and write them out individually to text format, which would get you close to your desired output.