I am analyzing miRNA sequencing data derived from a typical Illumina protocol for small RNAs. The problem is that although I have trimmed the reads for the adaptors, there are a lot of reads with length longer than 26bp.
What are these long reads? Should I trim the reads for hairpin sequence too? What's been actually sequenced? Else, what are the steps needed before aligning the reads to the reference sequence?
which species? I have found a lot of tRNA, that could or couldn't be functional. If you are interested to know what they are, just map to ncrnadb, and see what are the longer reads. But if not, just follow Devon advice. Just make sure it happens the same everywhere.
There's more than just miRNAs in a typical smallRNAseq experiment. You'll also have piRNAs, snRNA, snoRNAs, and so on. If all that you care about are the miRNAs, then just focus on them and ignore the rest.
######## First step is to trim your file quite carefully, I do it with Cutadapt
/usr/local/bin/cutadapt --discard-untrimmed --minimum-length="Number" --maximum-length="Number" -a <adapter_sequence> In_seq.fastq > your_trimmed_file.fastq
###### then remove also the reads mapping to ribosomal and tRNA sequences
bowtie --seedlen=23 --un output_file.fastq /path_to/bowtieindex/r_tRNA your_trimmed_file.fastq > /dev/null
your output_file.fastq should look better to align.
which species? I have found a lot of tRNA, that could or couldn't be functional. If you are interested to know what they are, just map to ncrnadb, and see what are the longer reads. But if not, just follow Devon advice. Just make sure it happens the same everywhere.