Hello,
I run kallisto on my data and I am in the process of assigning gene names to my data. I tried to do this in 2 different ways but I get different results. The first way I tried is shown below using the t2g.py from https://github.com/pachterlab/kallisto-transcriptome-indices/releases:
#Create the transcripts_to_genes file
python t2g.py --use_version <Homo_sapiens.GRCh38.111.gtf> transcripts_to_genesv111.txt
#Assign gene names:
t2g <- read.delim("/data/transcripts_to_genesv111", sep="", header=FALSE)
head(t2g)
txi.kallisto <- tximport(tsv_files, type = "kallisto", tx2gene = t2g[,c(1,3)], ignoreTxVersion = FALSE)
The second way I was using biomart:
mart <- biomaRt::useMart("ensembl", "hsapiens_gene_ensembl", host= "https://jan2024.archive.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "ensembl_gene_id" ), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id, ext_gene = external_gene_name)
I noticed that the t2g file I created using biomart gives me different results.
For example for gene ZSCAN2 the t2g file created the first way seems to be associated with gene id ENSG00000176371.14. In the t2g file created with biomart is is associated with ENSG00000176371.14 and ENSG00000291625.1 . ENSG00000291625.1 is not in the gtf file from that version but it is in Homo_sapiens.GRCh38.cdna.all.fa.gz as shown below:
ENST00000708196.1 cdna scaffold:GRCh38:HG2280_PATCH:1092832:1112495:1 gene:ENSG00000291625.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ZSCAN2 description:zinc finger and SCAN domain containing 2 [Source:HGNC Symbol;Acc:HGNC:20994]
What is the best way to assign the gene names? Also is is better to assign "external_gene_name" or "ensembl_gene_id" for the output?
Thank you
According to HUGO there is only one ZSCAN2 gene and it points to the first Ensembl gene ID (ENSG00000176371). Ensembl is annotating another copy on a scaffold patch so there seems to be some new information here that would need to be resolved over time.
Thank you! I ended up using the transcripts_to_genes instead of biomart.
Do you think that I should be including the information from the scaffold patch for the quantification? Initially I used the transcript_to_genes file which doesn't use it. However, now I am wondering if I should be using biomart. When I use biomart the counts for that genes are double (780) compared to when I use transcript_to_genes (390).