I ran the GATK tool ASEReadCounter to measure allele specific expression (ASE). There were 100 BAM input files processeced by ASEReadCounter. The output (ASE file) is a table for each input file with read counts for the reference allele and the alternative allele (SNP). However, there are no annotations in such file, which is required for later statistical analyisis (i.e. calculate ASE per gene).
My approach was to extract the SNP ids from all 100 ASE files (subsetted by minimun depth 10, which reduced file size significantly), concatenate them and unique them; I ended up with 4 million distinct SNP ids.
Using a small list of SNPs, I worked out how to retrieve, for a given SNP id, the annotations contained in the reference VCF file. The result was that for 10 distinct SNP ids the process took 15 minutes (scaling this up to 4mill ids, I get 11 years!). The VCF file contains 88million lines.
I also splitted the unique SNP ids file into 4 chunks (1mill ids per chunk) and submitted each chunk to ensembl's Variant Effect Predictor. The job has been running since 10 hours. I do not have much expectations on this one.
There must be a better way to do this I am not aware of, which is why I will greatly appreciate your input on how to proceed further.
I am also considering re-runnig the "count reads per allele analysis" with another tool that includes annotation (ASEReadCounter does not); also I would kindly ask for your suggestions on how to do this.
Thank you Emily, besides the long time to complete the job, is there another negative aspect to run such many variants with the online tool?
There is minimal security on data submitted into the VEP, so if your data is of a sensitive nature you may also prefer to keep it in-house. Other than that, not really.