Question

How to establish haplotype-specific gene expression levels

2

Entering edit mode

7 months ago

javanokendo ▴ 60

I have RNAseq data from four different strains and I wanted to establish haplotype-specific gene expression levels. How can this be accomplished?

RNA-seq Haplotype • 598 views

ADD COMMENT • link updated 6 months ago by dsull ★ 6.9k • written 7 months ago by javanokendo ▴ 60

0

Entering edit mode

Do you have a tetraploid organism? If so, create a tetraploid genome reference (from VCF files) and map against that.

I pinged the author of STAR on GitHub about whether STAR can do something like this.

If using pseudoalignment-based approaches like kallisto or salmon, you might be able to use g2gtools to create your "tetraploid" reference and then map against that reference.

ADD REPLY • link 7 months ago by dsull ★ 6.9k

0

Entering edit mode

The haplotypes I work with are from the diploid organism.

ADD REPLY • link 7 months ago by javanokendo ▴ 60

score 0 · Answer 1 · 2024-04-22

Circling back to this post: Although I haven't received a response on GitHub, I've played around with a few things.

First, understand that there is a single "reference" genome and everything else (i.e. what you find in VCF files) are variants with respect to the reference.

Option 1: Transcriptome-mapping - If you want to use a pseudoalignment-based approach, just create diploid transcriptomes. Use something like g2gtools with your VCF files -- which basically takes reference transcripts and inserts the variants to create a new transcriptome for each strain. So, you'll have your one reference transcriptome as well as your variant transcriptomes. You can dump any pair of transcriptomes (whether reference or variants) into your, say, kallisto index, and map your reads to this diploid index.

Option 2: Genome-mapping: If you want to use a genome-alignment (e.g. STAR), you can create a standard STAR index from your reference genome like usual. Then use the WASP option in STAR when doing the alignment, which takes in a VCF file. Your VCF file can contain just one variant with respect to the reference, or can be a VCF with two variants. You can look at the SAM tags to determine which allele (i.e. strain) a read came from. If your VCF contains two variants, you can write a script that constructs a new genotype for each VCF record; a genotype (GT) is represented as X/Y (where X = organism 1 and Y = organism 2). Let's say you look in your VCF file and a SNP for organism 1 is ./. and the same SNP is 1/1 for organism 2, and let's say the REF allele is A and the ALT allele is G. That means that, at that SNP position, there isn't actually a variant (or evidence for a variant) for organism 1 (so we'll just assume that that position is the REF allele, "A", for organism 1) but for organism 2, there is the "G" variant. In your new VCF, have the GT for that SNP be 0/1. When you do STAR WASP mapping, the vA SAM tag will contain "1" when a base in a read aligns to organism 1 (i.e. X in genotype X/Y), and will contain "2" when a base in a read aligns to organism 2 (i.e. Y in genotype X/Y), and will contain "3" when a base in that position could not be matched to either the organism 1 allele or the organism 2 allele. You can separate those out by using samtools and grep for the vA SAM tag (also worth checking out the vW tag, which tells you the results of the WASP filtering; you want vW:i:1).

Be sure to understand the structure of the VCF fie: For example, if your REF allele is C and your ALT allele is G,T,A, the genotype alleles would be: 0=C, 1=G, 2=T, 3=A, so represent that accordingly when making your combined VCF file (e.g. if the SNP is 1/1 for organism 1 and 3/3 for organism 2, you'd want the new GT to be 1/3).

Option 3: Post-genome mapping. Another option, similar to the above, is to look at the mapped reads, and see if a mapped read contains SNP positions (remember, the VCF file provides chromosome coordinates of each variant and the BAM file tells you the chromosome coordinates of the mapped read). If so, see if the SNP position in the mapped read is the organism 1 allele nucleotide or is the organism 2 allele nucleotide.