I have an MAF file with about a couple million SNVs in it from various samples, and I'd like to verify that the file has the right reference bases. Since I'm not really looking for a 'sequence' it seems like querying Ensembl for each one wouldn't be vey efficient. Is there any straightforward way to just submit a table of chromosomes and positions and get the hg19 reference bases at those loci?
A somewhat related question (as it may be why I'm getting 'wrong reference base' errors for a tool I'm trying to use, motivating this post): when an MAF file has a 'start' and 'end' position for an SNV that differ by 1, which one is the actual position of the SNV? Thanks.
duplicate: R: How to loop to extract all sequences within specific positions? ; Extract User Defined Region From An Fasta File ; ...
Have no script ready now but the concept is actually quite simply:
Get the variants in BED format and then use
bedtools getfasta
to get the nucleotide for each position. I would simply make a two-column table with $1 being the nucleotide from the MAF and $2 the one fromgetfasta
. A simple one-liner, e.g.awk 'OFS="\t" {if ($1 != $2) print NR}' file.txt
would tell you then if there were any mismatches and in which row of the file so that you know where in the MAF file you have to clean up.