Question

Removing rRNA,tRNA,snoRNA, etc from miRNA sequence data

1

Entering edit mode

7.1 years ago

tofukaj ▴ 20

Dear all great helpers,

I'm very new in miRNA-seq analysis field. With my limited knowledge, I understand that I need to remove rRNA,tRNA,snoRNA, mitocondrial RNA etc from the adapter-trimmed miRNA-seq fastq file prior to aligning to the miRNA database. So I need to gtf file to perform alignment in order to filter out such contaminated sequences (please correct this if I misunderstand something here).

I'm planning to use Gene 'gtf' file from Ensembl. As far as I notice, it contain everything except for tRNA data. My question is as following:

Is it valid if I simply remove miRNA data from my 'Gene gtf file' (using: grep -wv miRNA) and then append tRNA data transformed from bed to gtf format (using: cat tRNA.gtf Gene.gtf > new.gtf)? In this case, I will have 'new.gtf' which contains any known sequences except known miRNA and unknown miRNA.
I'm wondering if anyone used to use STAR for filtering out contaminated sequences, and how to suitably set the parameters for such job?

I must beg you all pardon in advance, if I make any mistake here.

Best Regards,

Kaj

RNA-Seq alignment • 4.0k views

ADD COMMENT • link updated 7.1 years ago by glihm ▴ 660 • written 7.1 years ago by tofukaj ▴ 20

score 2 · Answer 1 · 2017-10-20

Hello tofukaj,

when you are dealing with "filtering" data from specific features (rRNA, tRNA etc...) you have two very basic approaches:

You can map your reads against a database containing only the features you don't want. (So, a database of rRNA, tRNA, snoRNA etc...). The unmapped reads are your reads of interest (filtered reads). Then you can map these reads against your miRNA database.
If you deal with alignment files (BAM/SAM for instance), you can filter your alignment files by removing those matching features you don't want using GTF/GFF file.

I'm wondering if anyone used to use STAR for filtering out contaminated sequences, and how to suitably set the parameters for such job?

You must choose the aligner depending on your data. If you deal with compact genomes (Proka, Yeast), you don't need "junction aware" aligner (such Tophat2) and Bowtie1/2 are good. Otherwise, Start or Tophat2 are well suited for this job. You have to set the option in order to get ALL the possible hits on your database of not known features (tRNA, rRNA etc...). You can allow a read to match at several positions, like this you are sure to clean your data as much as possible. Depending on your read lenght, you can allow up to 2 mismatches to have a good starting point.

Is it valid if I simply remove miRNA data from my 'Gene gtf file' (using: grep -wv miRNA) and then append tRNA data transformed from bed to gtf format (using: cat tRNA.gtf Gene.gtf > new.gtf)? In this case, I will have 'new.gtf' which contains any known sequences except known miRNA and unknown miRNA.

Just to be clear, the alignment is done using fasta file. So, you can do what you want with your GTF file depending on what you are planning to do. In your case, I do recommend to create the GTF file you talked about (with only "contaminant"), then you can extract the sequences using the GTF and the FASTA file of your genome to then create a database of "contaminant". From here, you can follow the discussion above to align and then get the unmapped reads to align them against your miRNA database.