Question

Ensembl IDs

0

Entering edit mode

2.7 years ago

jensen.416 • 0

I'll start with my end goal: I want to add RNAseq expression data to a whole exome sequencing dataset (VCF file). Both of which are aligned to the GRCm38.p6 genome.

I have annotated my WES VCF using the ensembl vep annotator (release version 102):

./vep --input_file sample1.vcf --output_file sample1.vep.vcf --format vcf --vcf --terms SO --symbol --tsl --fasta mm10.fa --offline --cache --species mus_musculus --merged --cache_version 102 --dir_plugins ~/VEP_plugins --plugin Wildtype --plugin Frameshift --pick --transcript_version

After running, my variant reads in the WES vcf have ensembl IDs (ENS...).

My next step is quantify the expression of my RNAseq reads. I used Salmon to do this.

First I built a decoy aware transcriptome file using SalmonTools Salmon Tools takes three files, a genome fasta, transcriptome fasta, and a gtf. I used the same genome fasta and gtf that I aligned my reads with (GRCm38.p6). The transcriptome fasta was retrieved from the ensembl ftp mus_musculus_c57bl6nj cdna fasta. SalmonTools generates a file called gentrome.fa, which is then used by Salmon to create an index. Then I can run salmon quant with my RNAseq sorted bam files, giving me my expression data.

Here are the first couple lines of the output (sample1_quant.sf):

Name    Length  EffectiveLength TPM NumReads
MGP_C57BL6NJ_T0002636.1 9   10  0   0
MGP_C57BL6NJ_T0002637.1 11  12  0   0
MGP_C57BL6NJ_T0002640.1 14  15  0   0

Next, using vcf-expression-annotator (part of the vatools package) I tried adding my expression data to the WES vcf file:

vcf-expression-annotator sample1.vep.vcf sample1_quant.sf custom transcript -i Name -e TPM -s TUMOR -o sample1.exp

After checking the output, I have blank expression annotations.

My thought is that it must be tied to the different IDs being used in the vcf (ENS...) and the expression file (MGP...). Working backwards, I was able to deduce that the transcriptome file used in the SalmonTools step is the source of the MGP IDs. I have tried looking for other cdna fasta files that use ENS IDs, but haven't found any yet.

Has anyone run into a similar problem? Or maybe someone who has a better background in VEP or Salmon could lend some advice?

sequencing salmon RNAseq ensembl • 1.3k views

ADD COMMENT • link updated 2.7 years ago by GenoMax 148k • written 2.7 years ago by jensen.416 • 0

0

Entering edit mode

Well, I have no experience with Salmon or SalmonTools, but it indeed seems that this can be fixed by just altering the IDs.

The normal Ensembl IDs for mouse are ENSMUSG (ENSemblMUSmusculusGene) and ENSMUST (ENSemblMUSmusculusTranscript), such that a simple perl -i -pe 's/MGP_C57BL6NJ_/ENSMUS/g' *.sf should fix the IDs in your .sf files. (Make sure you have a backup, because the command will alter the files directly in place).

ADD REPLY • link 2.7 years ago by Matthias Zepper 5.0k

2

Entering edit mode

I don't think this is correct: http://www.ensembl.org/Multi/Search/Results?q=ENSMUST0002636.1;site=ensembl

ADD REPLY • link 2.7 years ago by GenoMax 148k

0

Entering edit mode

True, I didn't check if the IDs really exist. I also overlooked that Jensen used the cDNA sequences from the C57BL/6NJ strain and not the wild-type C57BL/6J cDNA sequences. So yes, the data integration there is a bit more tricky.

ADD REPLY • link 2.7 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Moving your answer to a comment. Feel free to delete if you feel it is no longer applicable.

ADD REPLY • link 2.7 years ago by GenoMax 148k

0

Entering edit mode

You can annotate vcf file by a python script using a python script. But in your case, the length of the regions seems too small.

ADD REPLY • link 2.7 years ago by Pappu ★ 2.1k

0

Entering edit mode

Is there a reason you're going around adding "use python" comments on various posts?

ADD REPLY • link 2.7 years ago by Ram 44k