Entering edit mode
3.9 years ago
kb_93
▴
10
Hello!
My dataset contains a number of smallRNAs and after alignment using bowtie with the non-coding region of the human genome (most recent version of HG 38) only 51% was aligned. I was wondering if I should maybe grep for example miRNA and piRNA first and align them to miRBase and piRbase respectfully. Then align those that remain to the non-coding region of the genome? Or is there another way I could improve alignment score?
Any suggestions or help is greatly appreciated!
Katie
Why only the non-coding part of the genome? How are you defining the non-coding part of the genome. I'd guess that quite a lot of these will come from parts of the genome marked as coding in some sense - they will either be degredation products of coding sequences, or small RNAs that are embedded in longer sequences (many small RNAs come from introns, although you don't say if you are including introns in your definition of non-coding).
You can align to a small RNA transcriptome if you like - miRBase and piRbase if those are what you are interested in. RNACentral if you are interested in a broader range of small RNA types. No need to grep them out of the fastq first.
I had got previous advice suggesting to use the non-coding region of the genome and I used the ensemble ncRNA fasta sequence. Would you recommend aligning to the whole genome instead?
I have 8 different small RNAs I'm looking at with this data so I wasn't sure what the best approach was for alignment and then annotation.
the ensembl ncRNA fasta represents non-coding transcriptome, rather than the non-coding parts of the genome. There are many regions of the genome that are neither annotated as coding nor non-coding RNA.
There are different schools of thought about whether you should align small RNA seq to the (small RNA) transcriptome or the genome. I would expect the piRNAs and miRNAs to be the best represented single sequence types in a small RNAseq expriment, but I wouldn't neccessarily expect it to be all, or perhaps even most. If the 8 sequences you are intersted in are present within the ensembl ncRNA set, then you basically have two choices - continue to align to that, and accept the low mapping rate, or map to the genome and extract the reads that map to your RNAs of interest.
The advantage of mapping to the genome is that you will map more reads (although many of those might map to location with no, or no useful, annotation). The disadvantage is that you have to convert the locations into RNAs, there might be complications in counting (e.g. what to do about reads that overlap the small RNA gene in question, but arn't entirely compatible with it), and you also have the problem that many ncRNAs appear in multiple locations in the genome.
In the end either way is probably fine, as long as you are aware of possible complications of each approach.
Ok thank you, I didn't realise that. I will try aligning to the whole genome (using GRCh38.p13 from GENCODE) and see if that improves my alignment.
One of my problems was that neither Ensembl nor GENCODE list piRNA within their biotypes. Will using the whole genome for alignment mark these piRNA reads as aligned?
GENCODE deals solely with the transcriptome, not the genome. If you want to align to the whole genome, you'll want to get if from Ensembl, NCBI or UCSC. If you align to the whole genome you will definitely align to the piRNAs, but you might not be able to find them afterwards if they are not annotated, as you won't know which of the aligned reads are aligned to the genome.
Also I wonder if piRNAs are particularly repetitive, making them hard to both align to on the genome, and hard to find on the genome.
If you are working on piRNAs, I might be tempted to go with your idea of aligning to piRbase. You will only align a small fraction of your reads, but that's because only a small fraction of the RNA in a cell is from piRNAs.
The GENCODE website states the fasta file I used was a genome sequence, is this incorrect?
Thank you for your help!
My bad. Sorry, there is a link to the genome sequence on the GENCODE website. I assume its just a copy of the Ensembl genome assembly.