Hi folks,
I am planning to cluster similar reads from a fastq file.
For example:
read1 (length-30): AGTCGATCGATCGAGTCTGCGTCGATCGGG(30 bases)
read2 (length-28): AGTCGATCGATCGAGTCTGCGTCGATCG (28 bases are matching)
read3 (length-25): - - - CGATCGATCGAGTCTGCGTCGAT - - (25 bases are matching)
read4 (length-30): CGAGTCTGCGTCTCGAGTCTTCGAGTCTGA (30 bases)
read5 (length-27): CGAGTCTGCGTCTCGAGTCTTCGAGTC (27 bases are matching)
read6 (length-23): - - -GTCTGCGTCTCGAGTCTTCGA - - - - (23 bases are matching)
readN: ATCGATCGAGTCTGCGTGCGTCTCGAGTCTT (30 bases)
Now I need to cluster (read1,read2, read3 together) similarly need to cluster (read4, read5, read 6).
Expected Output:
Sequence ------------------- Reads that falls with this sequence ------- Frequency
AGTCGATCGATCGAGTCTGCGTCGATCGGG - Read 1, Read 2, Read 3 ----------------------- 3
CGAGTCTGCGTCTCGAGTCTTCGAGTCTGA - Read 4, Read 5, Read 6 ----------------------- 3
ATCGATCGAGTCTGCGTGCGTCTCGAGTCTT - Read a, Read b, Read c, Read d -------------- 4
Thanks a lot, Genomax2. I will read the post and play around with the tool. The reason why I would like to cluster the reads is, I am working on miRNAs which have conserved regions in them.
The clustering part would work without any problem with
clumpify.sh
. I had asked Brian Bushnell to put a feature in for the counts but I don't think that has been implemented in yet.Thanks, Genomax2. I will write a perl script to calculate the frequency.
Firstly, I trimmed off the 3'adapter from our reads using Trimmomatic tool. Then aligned my trimmed reads against the all_mature_miRNA_sequence.fa from miRBase database using the Bowtie2 alignment tool. I have 2,148,364 reads in my sample. But the alignment score was just 2%.
Bowtie2 Stats: Aligned 0 times: 2,104,932 - Aligned 1 time: 2,770 - Aligned > 1 times: 40,662
If possible, can you look into the following posts too?
Question: mirBase mature microRNA sequences have base U instead of base T. Should I change or not?
Question: Do I need to download any specific adapters for Illumina small RNA sequencing kit?
You may want to cluster first with
clumpify
and then do trimming withbbduk
(both from bbmap, there is anadapters.fa
file that contains all commonly used adapter sequences in theresources
directory that you can use). @Brian has suggestions for aligning miRNA data. Let me see if I can find that thread/you can search yourself.You may also want to use bowtie v.1 since these are small RNA's and you don't expect gapped alignments.
Oh, I used Bowtie2. I will try with Bowtie1 now. I will also search for Brian Bushnell post on miRNA data.