I have RNAseq data from four different strains and I wanted to establish haplotype-specific gene expression levels. How can this be accomplished?
I have RNAseq data from four different strains and I wanted to establish haplotype-specific gene expression levels. How can this be accomplished?
Circling back to this post: Although I haven't received a response on GitHub, I've played around with a few things.
First, understand that there is a single "reference" genome and everything else (i.e. what you find in VCF files) are variants with respect to the reference.
Option 1: Transcriptome-mapping - If you want to use a pseudoalignment-based approach, just create diploid transcriptomes. Use something like g2gtools with your VCF files -- which basically takes reference transcripts and inserts the variants to create a new transcriptome for each strain. So, you'll have your one reference transcriptome as well as your variant transcriptomes. You can dump any pair of transcriptomes (whether reference or variants) into your, say, kallisto index, and map your reads to this diploid index.
Option 2: Genome-mapping: If you want to use a genome-alignment (e.g. STAR), you can create a standard STAR index from your reference genome like usual. Then use the WASP option in STAR when doing the alignment, which takes in a VCF file. Your VCF file can contain just one variant with respect to the reference, or can be a VCF with two variants. You can look at the SAM tags to determine which allele (i.e. strain) a read came from. If your VCF contains two variants, you can write a script that constructs a new genotype for each VCF record; a genotype (GT) is represented as X/Y (where X = organism 1 and Y = organism 2). Let's say you look in your VCF file and a SNP for organism 1 is ./. and the same SNP is 1/1 for organism 2, and let's say the REF allele is A and the ALT allele is G. That means that, at that SNP position, there isn't actually a variant (or evidence for a variant) for organism 1 (so we'll just assume that that position is the REF allele, "A", for organism 1) but for organism 2, there is the "G" variant. In your new VCF, have the GT for that SNP be 0/1. When you do STAR WASP mapping, the vA SAM tag will contain "1" when a base in a read aligns to organism 1 (i.e. X in genotype X/Y), and will contain "2" when a base in a read aligns to organism 2 (i.e. Y in genotype X/Y), and will contain "3" when a base in that position could not be matched to either the organism 1 allele or the organism 2 allele. You can separate those out by using samtools and grep for the vA SAM tag (also worth checking out the vW tag, which tells you the results of the WASP filtering; you want vW:i:1).
Option 3: Post-genome mapping. Another option, similar to the above, is to look at the mapped reads, and see if a mapped read contains SNP positions (remember, the VCF file provides chromosome coordinates of each variant and the BAM file tells you the chromosome coordinates of the mapped read). If so, see if the SNP position in the mapped read is the organism 1 allele nucleotide or is the organism 2 allele nucleotide.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you have a tetraploid organism? If so, create a tetraploid genome reference (from VCF files) and map against that.
I pinged the author of STAR on GitHub about whether STAR can do something like this.
If using pseudoalignment-based approaches like kallisto or salmon, you might be able to use g2gtools to create your "tetraploid" reference and then map against that reference.
The haplotypes I work with are from the diploid organism.