Hey all, I have a few samples of RNA seq data (Illumina paired-end 161 bp), for which I need to do allele-specific expression testing.
Now, I have done some standard RNA seq work before (DGE, GO enrichment etc.) but am completely clueless about the current assignment. I have been suggested to use ASEReadCounter from GATK for the job and while going over the documentation I saw it needs both a BAM file ( which I have , after aligning my reads to STAR) and a vcf file containing the sites to be processed. Since I do not have any genotype or DNA seq data from my samples, I tried downloading the 1000 genomes vcf sites file (called on hg38). Heres the link ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ and heres the file from the directory ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
I tried running GATK ASEReadCounter with this and reference as primary_genome fasta file (hg38) downloaded from GENCODE release 36. The output gives the following warning
“Ignoring site: variant is not het at position: chr1:(numbers)”.
for all positions in the vcf file. This is likely due to the fact that this vcf file does not contain any genotype info (no GT field).
I would be thankful if someone could answer these questions
Is it possible to do ASE with only RNA seq data, like I have and go DNA data? If so, it would be really helpful if you could direct me to a link/workflow.
Is it feasible if I just add a GT field to the sites vcf and populate it with 0/1 for all the sites. This will trick the program into thinking that all sites are heterozygous and so the program will maybe work as intended. Is this logic okay? Or do I need genotype info for my samples for ASE to work?
Thanks a lot,
If an organism is heterozygous for a coding variant, the two alleles or copies of a gene will show unequal levels of expression which is defined as ASE. You will get to know if a site is heterozygous only through variant calling or if you have a VCF file which contains genotype information. You cannot put dummy values (0/1) for heterozygous genotype and get correct ASE results. You need to specifically have high confidence heterozygous sites in order to do an ASE study.
Furthermore, ASEReadCounter from GATK will not tell you if a site shows ASE or not, it will just give you the allele counts. You need to do a statistical test (such as binomial test) to test if the expression of the allele in the site significantly deviates from 50%.
If you do not have DNAseq, you can try calling variants using RNAseq data itself. It is not always recommended to do so because of unequal coverage of RNAseq reads across the genome, but it will surely help your case.
Hey, thanks for the reply. So, I saw that ASEReadCounter only gives me a count table with allele counts for ref and alt alleles. I was actually hoping to use the tool ASEP https://github.com/Jiaxin-Fan/ASEP for doing the statistical test and getting the genes with differential allele expression but it requires the sort of output ASEReadCounter.
Also, will making all the sites heterozygous never work? If I put in dummy values (0/1) why would GATK care? Is it not just looking at the ref and alt alleles at each position ( say ref = A , alt = C) and counting the number of reads in the BAM file which align at that position and assigning them to alt or ref based on the read's base at that position? Why does it need genotype information? Only to see if the site is heterozygous?
Also, it would be super helpful if you could point me to some tools/pipelines which will allow me to call variants from my RNA seq BAM files. Further, will it be appropriate to keep separate vcf files for each samples ( i have 37 samples) and run GATK with each BAM vcf pair or should I merge the vcfs to and then run GATK using the merged vcf and my BAM files?
Thanks again!!
GATK has this: https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-
Also, an RNAseq variant calling pipeline is by concept same as DNAseq variant calling pipeline with only certain differences such as duplicate marking is generally not advised in RNAseq for ASE studies (a duplicate marking tool cannot distinguish between an if a read is an actual PCR duplicate or is present due to high expression). You always mark duplicates for DNAseq variant calling (variable opinions exist for duplicate marking in RNAseq reads).
Furthermore, which sites would you consider for ASE studies is also important. You would not consider sites that have a genotype quality phred score lower than 20 or 30. Also, if you make a homozygous alternate site as heterozygous, you will get fake monoallellic expression results. How will you distinguish it from a true monoallelic expression? Rather than going for a short cut, you should try to get high quality genotypes from RNAseq reads. ASEReadcounter by default only analyses heterozygous sites, so, yes, the tool uses 0/1 information to find if a site is heterozygous.
Also, ASE is individual specific. A person can be homozygous and another person can be heterozygous for the same variant site. Your dummy values may create issues in downstream analysis and interpretation.
This paper may be helpful for understanding best practices in an ASE study (especially Quality control of genotype data): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0762-6
Finally, if you don't want to use GATK pipeline, you can also go for bcftools mpileup for detecting variants and genotypes. It has been oberved that if an accurate RNAseq aligner (such as HISAT2 or even STAR) is used, both GATK and bcftools mpileup give similar results. Merging samples and calling variants is something that is recommended as it helps for joint genotyping. You should also remember that there is no clear cut consensus for an ASE analysis. Different tools exist for detecting ASE. You can go for sophisticated ASE detection algorithms such as MBASED, GeneiASE, ASEP, etc. Or, you can limit yourself to just binomial or a Fishers exact or chi square test. Also, try to mitigate reference bias (an important concept in ASE studies).