Up to now, I've been aligning my RNAseq PE samples to a reference whole genome build from Ensembl/gencode. My work involves calling somatic variants at specific regions on chr1, 8, 11, 12, 17, 22.
I have a BED file of my genes of interest, but is it preferred to build a custom reference genome based on the BED file or to concatenate chr1, 8, 11, 12, 17, 22? If using BED is preferred, would I use fastafrombed
function in bedtools?
I understand that my custom reference genome will not have unplaced contigs included but I'm curious how bad this will be if I'm only concerned with calling variants on known regions in the first place.
Curious to hear your thoughts and input.
Why are you doing this? What advantage are you going to get (besides saving some time on alignments)?
But to answer your question the way you have described would be one way of doing it.
It's mostly time. I use STAR aligner which is very fast compared to TopHat, but it's still taking about 20 minutes to map one paired-end sample. I have currently 200+ RNAseq files and any amount of time saving will help.
Would you say that it's safer to use a custom ref genome of entire chr 1, 8, 11, 12, 17, 22 then one based on highly specific BED file of my genes of interest? I'm particularly concerned that some reads are bound to misalign slightly which may not get pulled if I used a very specific custom ref genome (i.e. built from BED file).
Some reads are likely going to mis-align as soon as you stop including the entire genome. Before you dive in all the way you may want to compare the results for a couple of samples (alignment against the whole genome and the reduced representation as chromosomes/BED regions) to check on the differences.
^This. Aligning against partial references is a bad idea, unless you're damn sure that all your input sequence comes from a a restricted region.
If time constraints is the issue you can also align everything in parallel modes at the expense of memory utilization to speed up the tasks. In this thread I see someone mentioning STAR to align 60 million reads at 4 mins. You can also take a look at fast methods transcripts assembly like Salmon if you are not restricted only to alignment based methods. In any case for SNP calling from RNA-Seq which is at times a bit tricky it would be interesting to use the region of interest bed file as used in GATK for tagret interest intervals for exome data analysis. If this can be modeled for your RNA-Seq in a way to find the read read intensities across your target regions and then make SNP calls only to those regions. I do not know if anyone did it but worth taking a try if that is possible.