Question

Masking before RNA-Seq Alignment and Gene Prediction in Plant Genomes

0

Entering edit mode

21 months ago

Ayish • 0

Hello Experts,

I'm interested in conducting gene prediction for a plant genome using RNA sequencing data. To achieve this, I intend to perform RNA sequencing alignment and employ BRAKER2 for gene prediction. Before running BRAKER2, I'm considering applying soft-masking to the genome. I would like to clarify whether I should perform masking before creating the BAM alignment file or if it's possible to align the RNA sequencing reads with the genome without masking. I thought it should not matter for RNA-seq data because they are coming from exons which I assume contains no long-range repeats.

Request your suggestion/guidance.

Thank you in advance.

RNA-evidence genome prediction masking • 1.3k views

ADD COMMENT • link updated 21 months ago by Darked89 4.7k • written 21 months ago by Ayish • 0

1

Entering edit mode

Soft masking the genome (unless something changed) not make any difference for STAR. Depending on how your RNA-Seq is done (Illumina? paired?) you may improve the mapping by checking for overlaps between forward and reverse reads in the pair. See "Merging and mapping of overlapping paired-end reads." in the STAR manual

ADD REPLY • link 21 months ago by Darked89 4.7k

0

Entering edit mode

Thank you for reply. I have Illumina paired-end reads. Would it be fine if I use hard-masking for STAR and soft-masked genome for BRAKER2? I will take a look at mentioned section.

ADD REPLY • link 21 months ago by Ayish • 0

0

Entering edit mode

I am not sure what you gain by using hard masked genome for RNA-Seq mapping. There will be reads covering some introns. But if you mask repeats you risk that some exons will be truncated/have an introduced NNN gap because say 20bp CA repeat.
Then at least in some species you may have genome with a large number of pseudogenes/gene fragments derived from a supposedly real genes. No idea how you got your set of repeats for that species, but there is a chance that such tricky genes will be classified as repeats, and without any RNA-Seq mappings to a subset of these you will miss them.

ADD REPLY • link 21 months ago by Darked89 4.7k