Hi all,
Currently I am preparing the reference transcriptome used by RSEM in RNA-seq experiments. For this, I use rsem-prepare-reference function with .GTF and .fasta files downloaded from Ensembl (latest release, v.80).
However, I have some questions regarding the masking level of the genome (which can be complete genome, as well as soft- or hard- masked for repetitive sequences). Is there any influence of the masking level when I build the transcript reference? For example, if I use a hard masked genome instead of a complete genome, will that have a huge impact on my final transcript set (considering that I will be using the same GTF coordinates in both scenarios)?
I ask that because I saw that the human transcriptome may have some level of repetitive sequences and I don't know if these sequences are completely lost in the hard-masked genome.
Does anyone have some insight on that matter?
Thanks!
True! I just checked my transcripts.fa file and there are some small sequences (~10-20) full of Ns...
Thank you very very much!