Question

Why does assigning genes with biomart give me different values than using a transcripts_to_genes.txt file?

0

Entering edit mode

4 months ago

bioinfo ▴ 150

Hello,

I run kallisto on my data and I am in the process of assigning gene names to my data. I tried to do this in 2 different ways but I get different results. The first way I tried is shown below using the t2g.py from https://github.com/pachterlab/kallisto-transcriptome-indices/releases:

#Create the transcripts_to_genes file
python t2g.py --use_version <Homo_sapiens.GRCh38.111.gtf> transcripts_to_genesv111.txt

#Assign gene names:

  t2g <- read.delim("/data/transcripts_to_genesv111", sep="", header=FALSE)
  head(t2g)
  txi.kallisto <- tximport(tsv_files, type = "kallisto", tx2gene = t2g[,c(1,3)], ignoreTxVersion = FALSE)

The second way I was using biomart:

mart <- biomaRt::useMart("ensembl", "hsapiens_gene_ensembl", host= "https://jan2024.archive.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "ensembl_gene_id" ), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id, ext_gene = external_gene_name)

I noticed that the t2g file I created using biomart gives me different results.

For example for gene ZSCAN2 the t2g file created the first way seems to be associated with gene id ENSG00000176371.14. In the t2g file created with biomart is is associated with ENSG00000176371.14 and ENSG00000291625.1 . ENSG00000291625.1 is not in the gtf file from that version but it is in Homo_sapiens.GRCh38.cdna.all.fa.gz as shown below:

ENST00000708196.1 cdna scaffold:GRCh38:HG2280_PATCH:1092832:1112495:1 gene:ENSG00000291625.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ZSCAN2 description:zinc finger and SCAN domain containing 2 [Source:HGNC Symbol;Acc:HGNC:20994]

What is the best way to assign the gene names? Also is is better to assign "external_gene_name" or "ensembl_gene_id" for the output?

Thank you

biomart RNAseq kallisto • 474 views

ADD COMMENT • link 4 hours ago by bioinfo ▴ 150

2

Entering edit mode

According to HUGO there is only one ZSCAN2 gene and it points to the first Ensembl gene ID (ENSG00000176371). Ensembl is annotating another copy on a scaffold patch so there seems to be some new information here that would need to be resolved over time.

ADD REPLY • link 4 months ago by GenoMax 145k

0

Entering edit mode

Thank you! I ended up using the transcripts_to_genes instead of biomart.

ADD REPLY • link 4 months ago by bioinfo ▴ 150

0

Entering edit mode

Do you think that I should be including the information from the scaffold patch for the quantification? Initially I used the transcript_to_genes file which doesn't use it. However, now I am wondering if I should be using biomart. When I use biomart the counts for that genes are double (780) compared to when I use transcript_to_genes (390).

ADD REPLY • link 4 hours ago by bioinfo ▴ 150