Hi, I need some help automating and speeding up my data analysis.
For now, I track variants from BAM files using mpileup (samtools) and obtain a CSV (converted from VCF using BCFTools) with a structure like this:
CHROM:POS,ORIGINALBP/VARIANTBP
So it looks like this
5:112043282,C/G
What I want to do is add two more fields, one for CONTEXT (which would give me the BP before and after the original chromosome position) and whether or not the original BP is a Common SNP.
So it would look like this
CHROM:POS,ORIGINALBP/VARIANTBP,CONTEXT,SNP
5:112043282,C/G,TCG,YES
My problem is that for now, I have to lookup each position manually using the USCS Genome Browser (https://genome.ucsc.edu), zoom out 3x, and then manually copy the leading and trailing BP and check if it registers as a Common SNP. This gets me the Context and if the original chromosome position was a SNP or not. I want this part automated, but I don't know what the best way to go about this would be. I have my reference sequence (GRCh37-lite.fa) but I'm not sure how to go about extracting individual BP from it.
Wow, this is an awesome response and a bit over my head in terms of new tools.
Do each step on its own and then run
head
to look at the first few lines of the resulting file. Then you can see what these tools are doing.Once you're a bit more familiar with the tools, you could glue these steps together into a script to automate (other than the dbSNP download step, which you could just do once).