I'll start with my end goal: I want to add RNAseq expression data to a whole exome sequencing dataset (VCF file). Both of which are aligned to the GRCm38.p6 genome.
I have annotated my WES VCF using the ensembl vep annotator (release version 102):
./vep --input_file sample1.vcf --output_file sample1.vep.vcf --format vcf --vcf --terms SO --symbol --tsl --fasta mm10.fa --offline --cache --species mus_musculus --merged --cache_version 102 --dir_plugins ~/VEP_plugins --plugin Wildtype --plugin Frameshift --pick --transcript_version
After running, my variant reads in the WES vcf have ensembl IDs (ENS...).
My next step is quantify the expression of my RNAseq reads. I used Salmon to do this.
First I built a decoy aware transcriptome file using SalmonTools Salmon Tools takes three files, a genome fasta, transcriptome fasta, and a gtf. I used the same genome fasta and gtf that I aligned my reads with (GRCm38.p6). The transcriptome fasta was retrieved from the ensembl ftp mus_musculus_c57bl6nj cdna fasta. SalmonTools generates a file called gentrome.fa, which is then used by Salmon to create an index. Then I can run salmon quant with my RNAseq sorted bam files, giving me my expression data.
Here are the first couple lines of the output (sample1_quant.sf):
Name Length EffectiveLength TPM NumReads
MGP_C57BL6NJ_T0002636.1 9 10 0 0
MGP_C57BL6NJ_T0002637.1 11 12 0 0
MGP_C57BL6NJ_T0002640.1 14 15 0 0
Next, using vcf-expression-annotator (part of the vatools package) I tried adding my expression data to the WES vcf file:
vcf-expression-annotator sample1.vep.vcf sample1_quant.sf custom transcript -i Name -e TPM -s TUMOR -o sample1.exp
After checking the output, I have blank expression annotations.
My thought is that it must be tied to the different IDs being used in the vcf (ENS...) and the expression file (MGP...). Working backwards, I was able to deduce that the transcriptome file used in the SalmonTools step is the source of the MGP IDs. I have tried looking for other cdna fasta files that use ENS IDs, but haven't found any yet.
Has anyone run into a similar problem? Or maybe someone who has a better background in VEP or Salmon could lend some advice?
Well, I have no experience with Salmon or SalmonTools, but it indeed seems that this can be fixed by just altering the IDs.
The normal Ensembl IDs for mouse are
ENSMUSG
(ENSemblMUSmusculusGene) andENSMUST
(ENSemblMUSmusculusTranscript), such that a simpleperl -i -pe 's/MGP_C57BL6NJ_/ENSMUS/g' *.sf
should fix the IDs in your.sf
files. (Make sure you have a backup, because the command will alter the files directly in place).I don't think this is correct: http://www.ensembl.org/Multi/Search/Results?q=ENSMUST0002636.1;site=ensembl
True, I didn't check if the IDs really exist. I also overlooked that Jensen used the cDNA sequences from the C57BL/6NJ strain and not the wild-type C57BL/6J cDNA sequences. So yes, the data integration there is a bit more tricky.
Moving your answer to a comment. Feel free to delete if you feel it is no longer applicable.
You can annotate vcf file by a python script using a python script. But in your case, the length of the regions seems too small.
Is there a reason you're going around adding "use python" comments on various posts?