While searching for which reference to use for alignment, most of the posts I stumbled upon were recommending to use soft masked reference instead of hard masked one, assuming that the aligners know how to handle repetitive regions masked with lowercase. However in this discussion post Alex specifically mentions that STAR handles both uppercase and lowercase letters similarly. So the question is - given that the aligner doesn't account for repetitive regions - is it still better to use soft masked reference instead of hard masked one?
I think it depends on what you want to have in the output.
A soft masked reference will still allow you to read which bases were there, it will still allow reads to map on such region, but at the same time will allow you to distinguish between what was masked and what was not.
A hard masked version of the genome will not allow such a feature, but it will prevent alignment scores from masked regions which might as well not be interesting.
An annotation GFF file with masked position coordinates might be a solution: you allow reads to map everywhere on a soft-masked reference, to avoid biasing the scores towards unmasked regions, and then you filter out those who belong to regions that were soft-masked.
You do not want to use a hard masked reference. What you do not want is reads forced to align to the wrong place because you took away the part of the genome they were supposed to map to. Find a way to filter the reads after mapping them correctly.
Was just browsing some threads to see what others did, thought I'd post a method that might be of use to others. If you have a soft-masked version of the genome, you can use the final GTF file that you aligned to to extract sequences from with gffread (cufflinks tool). From there, you can convert the resulting fasta to tabular format with the fastx toolkit then use some bash scripting to count the number and portion of soft masked bases per transcript and use for filtering. Here's the code I used, with merged.tab being the tabular output from fasta_formatter (fastx) and the sequences on in the 3rd field.
#count number of soft-masked bases in 3rd field per line from merged.tab
awk '{print $3}' merged.tab | sed 's/a/N/g' | sed 's/t/N/g' | sed 's/g/N/g' | sed 's/c/N/g' | sed 's/A//g' | sed 's/T//g' | sed 's/G//g' | sed 's/C//g' | awk '{ print length($1); }' > rep_chars.txt
#count total number of bases per line in merged.tab
awk '{ print length($3); }' merged.tab > nonrep_chars.txt
#check that number of lines is the same with:
wc -l *rep_chars.txt
#then calculate the soft-masked portion of each sequence for filtering
awk '{print $1}' merged.tab | paste - rep_chars.txt nonrep_chars.txt | awk '{print $1, $2, $3, $2/$3}' > rep_nonrep_chars.txt
#filter in terminal or in program of your choice, in terminal modify command as needed.
awk '$4 > 0.25' rep_nonrep_chars.txt > soft_mask_filter.txt # or optional ' | wc -l' instead of output to get counts for exploration of your data
#the result will be those you want to remove, change greater than to less than to get the sequence names you want to keep
That explains it. Thank you.