Question

Alternate allele bias in RNA-seq data?!?!?!

0

Entering edit mode

8.0 years ago

mdgallagher71 ▴ 20

Hi Everyone,

I performed targeted RNA-sequencing, in which we used a probe pool originally designed for exome capture to capture a transcript of interest in RNA-seq libraries. We got 1,000-fold enrichment of the transcript, and did the experiment on 3 cell lines and 8 brain samples. All 11 samples are heterozygous for a set of ~100 completely linked variants (haplotype) at our locus of interest, 11 of which overlap exons, and thus were captured by our probes. We excluded probes that overlapped these SNPs, as well as other moderately linked SNPs, to reduce capture bias.

We did standard QC, aligned reads with STAR, reduced mapping bias with WASP, and used HaplotypeCaller and ASEReadcounter to generate allele counts for the reference and alternate alleles of all 11 exonic SNPs. We did the analysis both by removing all duplicates and by only removing optical duplicates, which gave the same results:

1) Our positive control cell lines (n=2 of the same cell type) show ~20% increased reads coming from the major (reference) haplotype, as determined by averaging the allelic ratio across 9/11 SNPs (two SNPs were removed due to being close to a 5bp indel and showing extreme bias)

2) Our third cell line (n=1 of a different cell type) shows no notable allele-specific expression differences using the 9 SNPs.

3) Our 8 brains show varied effects with the 9 SNPs, with some showing ~5% bias in the reference allele direction, and others showing bias in the alternate allele direction.

What we don't understand is that we are seeing huge variation in allelic ratios WITHIN samples but BETWEEN SNPs, even though these SNPs are in complete LD and thus almost certainly phased correctly (we confirmed correct phasing between some of the SNPs by looking at alignment or alignment pairs containing more than one SNP), and they are all contained within the predominant transcript variant, and in constitutive exons.

Most troubling is the fact that in the brain samples which show no clear pattern of allele-specific expression, many SNPs show ALTERNATE allele biases, which to my knowledge, shouldn't happen due to any source of technical bias. We know from other studies that the alternate allele should be expressed either at equal or lower levels, so the only type of bias we should be seeing is reference allele bias. In addition, the SNPs within a sample don't consistently agree on direction or magnitude of allelic bias.

Finally, important to point out that we typically have thousands of reads per SNP per sample, so that shouldn't be an issue, and we sequenced all 11 libraries on one lane of a HiSeq 2500 with 125bp paired end reads (~300 million total read pairs).

Thanks,

Mike

RNA-Seq SNP Allele-specific expression eQTL • 2.9k views

ADD COMMENT • link 8.0 years ago by mdgallagher71 ▴ 20

0

Entering edit mode

"huge variation in allelic ratios WITHIN samples but BETWEEN SNPs". This could happen if all the SNPs do not have equal coverage as the single read do not cover all the SNPs. As the different exonic parts might have been randomly sequenced, they might have varying read depth for each exon. One way is to check, if those alleleic inconsistencies are reproducible in replicates. If they are not, then it would be the technical noise.

Why do you think "alternate allele should be expressed either at equal or lower levels". I dont think, alternate allele should always expressed at lower levels than reference allele.

ADD REPLY • link 8.0 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks,

Yes, the allelic patterns we're seeing for each SNP are very reproducible across samples. In other words, not much sample to sample variation, but a lot of SNP-SNP variation within a sample. But we have hundreds to tens of thousands of reads per SNP, per sample, so I would think that wouldn't produce the kind of noise we're seeing.

You are right, however, that read depth affects allelic patterns. We see that SNPs with 5,000 or greater reads tend to have roughly equal 50/50 ratios of reference/alternate allele, or even 10-20% more reads containing the alternate allele. SNPs with less than 5,000 reads, and typically less than 1,000 or so, show very strong increases in the ratio of reference/alternate allele reads. A collaborator mentioned that this may be due to deduplication, since some of our "duplicates" may be real reads that match other read coordinates exactly, since we have so much sequencing depth in this experiment. If you remove these "duplicates", and there are more from one allele than another, this could cause bias. However, keeping all the non-optical duplicates did not change the results.

This locus has a well established eQTL effect in cell lines and brain, and the reference haplotype is associated with increased expression. So combined with the fact that technical bias is usually in the reference allele direction due to probe capture, alignment, etc., I can't think of any reasons that we should see alternate allele biases in SNPs that are covered by thousands of reads.

Mike

ADD REPLY • link 8.0 years ago by mdgallagher71 ▴ 20