Hello Experts,
I'm interested in conducting gene prediction for a plant genome using RNA sequencing data. To achieve this, I intend to perform RNA sequencing alignment and employ BRAKER2 for gene prediction. Before running BRAKER2, I'm considering applying soft-masking to the genome. I would like to clarify whether I should perform masking before creating the BAM alignment file or if it's possible to align the RNA sequencing reads with the genome without masking. I thought it should not matter for RNA-seq data because they are coming from exons which I assume contains no long-range repeats.
Request your suggestion/guidance.
Thank you in advance.
Soft masking the genome (unless something changed) not make any difference for STAR. Depending on how your RNA-Seq is done (Illumina? paired?) you may improve the mapping by checking for overlaps between forward and reverse reads in the pair. See "Merging and mapping of overlapping paired-end reads." in the STAR manual
Thank you for reply. I have Illumina paired-end reads. Would it be fine if I use hard-masking for STAR and soft-masked genome for BRAKER2? I will take a look at mentioned section.
I am not sure what you gain by using hard masked genome for RNA-Seq mapping. There will be reads covering some introns. But if you mask repeats you risk that some exons will be truncated/have an introduced NNN gap because say 20bp CA repeat.
Then at least in some species you may have genome with a large number of pseudogenes/gene fragments derived from a supposedly real genes. No idea how you got your set of repeats for that species, but there is a chance that such tricky genes will be classified as repeats, and without any RNA-Seq mappings to a subset of these you will miss them.