So I was given aligned reads of liver cancer samples (without matching normals). After viewing the header I noticed that it had chr1-22, chrX, chrY, chrM (note LN: 16571, not LN: 16569). No random or unlocalized contigs.
My worry is that these contigs were aligned elsewhere. Another issue is, I don't know if I can sort these with a generic ucsc.hg19.fasta (that includes random and unlocalized contigs) so that I can call (all) variants using samtools, bcftools.
Should I build my own reference, removing all the random/unlocalized contigs? Also I am only interested in calling variants in very specific regions (tp53, piwil1-4, setdb1). Is there a way of reducing the time to sort and is there a way I can use a 'shortened' reference assembly specific to my regions of interest? Thank you for your time and help.
EDIT: I should note that although I am working with cancer samples, I am not trying to call somatic variants. I just need a pileup of all (high quality) variants.
If your going to call variants - arguably the computationally hardest thing to do in sequencing genomics - you really don't want to leave anything to chance. Ideally, in my opinion, you would convert the BAM to uBAM and follow the GATK pipeline for re-mapping the reads to the ideal reference genome.
I mean, there is nothing stopping you from doing it the way you suggested - it's just... meh. Calling reliable SNPs is hard enough when you do everything perfectly.
Thank you for your reply. Ideally I would like to do this, but I have limited time. Most likely I can revisit these samples to follow GATK workflow. For sorting and mpileup, can I use a reference that's reduced to specific regions of interest or do I need the entire assembly?
Honestly, I don't know because I have never done that :( If anything you want to use the full reference but only have the region of your BAM that covers the areas you are interested in. Maybe just try it and see if you're short on time?
Actually i'm a little confused what you mean by sorting too. To sort a BAM file you dont need anything other than the BAM - and chances are it's already sorted. Perhaps its something specific to bcftools that I know nothing about. Regardless, your best option is to try it and see. You will either get an error instantly, or it will just skip over the reference bits it doesn't need.
The more important question is if the genome you are using to call SNPs is using the same co-ordinate system for all chrXYZ regions as what is present in the BAM file. Same major version with all the same patches.
Honestly, I would invest 1 week mapping this properly and having all the variables under your control than jumping into it now and finding out in a month that everything you have done was built on sand.