Question

Identifying and classifying the most abundant sequences in small RNAseq data

0

Entering edit mode

2.6 years ago

Mauro ▴ 20

Hi, I have the small RNAseq results for a human sample, where I mostly care about seeing what the top sequences are, and try to classify them (tRNA, rRNA, mRNA, lnc, etc).

I'm running into an issue when mapping straight to human genome (HISAT2) where most reads are multi mapped to highly conserved regions, so the gene/region counts (htseq-count) are wildly off when comparing the results to the "top 100 overly represented sequences" I get from the FASTQC report.

I need to keep the original sequence information at hand, so after mapping to a genome or RNA database I need to search to what each sequence in my "top 100" matched to.. is there a way for me to do run these queries against my bam file? Or maybe a better way of doing this?

Thanks!

hisat2 srna bam rna-seq • 524 views

ADD COMMENT • link updated 2.6 years ago by Matthias Zepper 5.0k • written 2.6 years ago by Mauro ▴ 20

score 0 · Answer 1 · 2022-04-13

How about running the nf-core smrnaseq pipeline for a starter?

A nf-core pipeline is of course an opinionated way of running an analysis, but usually those pipelines comprise most standard tools for a specific type of analysis. From the intermediate results, you can always dig deeper with a custom approach later, but typically there is no point in reinventing the wheel from the start.