Question

How to use NCBI annotation instead of Ensembl in analysing Transcriptome from RNAseq data

0

Entering edit mode

9 months ago

Vojtěch ▴ 10

Hey guys, I am new here, so I hope this is how it works. I am working on my master's thesis and for that, I need to analyse some bulk RNAseq data. I start with SRA files and I am doing the basic downstream analysis just fine. I end with .bam files which I need to annotate. I am using txdb objects. However, the protein of interest has different transcriptional variants on NCBI and on Ensembl and txdb uses only Ensembl, if I understand it correctly. Therefore I am asking you for help – is there a way to annotate my .bam files with NCBI data? I should add that I am interested in transcripts rather than whole genes (in Ensembl, I am using ESNT, not ENSG). How could I go about it?

# Making txdb object 
txdb <- makeTxDbFromGFF("/genome/Homo_sapiens.GRCh38.111.chr.gtf")
transcripts(txdb)

# Then I am choosing to work with 10th chromosome only
txdb2 <- keepSeqlevels(txdb, seqlevels_in_bam)
ebg <- exonsBy(txdb2, by="tx", use.names = TRUE)
seqlevels(ebg)
ebg

# Counting
names(BamFileList(bamfiles))
any(duplicated(names(BamFileList(bamfiles))))
se <- GenomicAlignments::summarizeOverlaps(features=ebg, reads=bamfiles,
                                       mode="Union",
                                       ignore.strand=TRUE)
se
head(assay(se))
write.csv(assay(se), file=paste(name,".csv", sep = ""))

Now I have list of ENSTs present in my reads. How do I do something similar but with NCBI annotation so I have NM/NP?

Thanks guys!

txdb Ensembl NCBI RNAseq • 2.0k views

ADD COMMENT • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

Curious as to why you want to use NM/NP ID's. People generally use gene names.

ADD REPLY • link 9 months ago by GenoMax 150k

0

Entering edit mode

Maybe I am not familiar enough with gene names, but I need transcripts rather than genes. Therefore now I am using ENSTs, to which NMs should be equivalent?

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

but I need transcripts rather than genes

Are you sure about that? Unless people want to do differential transcript analysis, counts are normally summarized at the gene level.

ADD REPLY • link 9 months ago by GenoMax 150k

0

Entering edit mode

Yes, I am very positive on this, because my thesis is about transcriptional variants of certain protein in human brain, therefore I need transcripts. And that complicates it, because I can't find many useful pipelines to use.

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

score 1 · Answer 1 · 2024-07-12

1

Entering edit mode

9 months ago

BioinfGuru ★ 2.1k

What was the reference genome used for mapping? If it was an ensembl reference genome used for mapping, then it should be an ensembl file (of the same ensembl version) used for annotations. They work hand in hand with the same coordinates. After annotation, it is then a matter of conversion from ensembl ID to NCBI ID.

I use biomart for annotations (within R or web browser). Look out for the following attributes from MANE

RefSeqmatch transcript (MANE Select)
RefSeq match transcript (MANE Plus Clinical)

biomart screenshot

ADD COMMENT • link 9 months ago by BioinfGuru ★ 2.1k

0

Entering edit mode

Thanks for the reply, I will look into it. For the bowtie index and mapping, I use "Homo_sapiens.GRCh38.dna.chromosome.10.fa". And for the annotation, I use "Homo_sapiens.GRCh38.111.chr.gtf". I don't care that much about IDs, but NCBI and Ensembl have different annotation. For example, in my protein of interest, cholineacetyltransferase (ChAT), NCBI shows different transcriptional variants and isoforms than Ensembl. That is why I would like to use both and compare them. I will look into the material you provided, thank you!

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

The most simple way to compare the difference between Ensembl and NCBI is to load their reference genomes into IGV, focus on a gene of interest, and show all transcripts.

If you want to see how the difference affects your data set then create 2 datasets:

mapped + annotated with ensembl
mapped + annotated with ncbi

Then create bigwig files for each (to show coverage), view on IGV, focus on a gene of interest, show all transcripts.

ADD REPLY • link 9 months ago by BioinfGuru ★ 2.1k

0

Entering edit mode

This is useful, but I already know the difference since I am looking only for one protein, I'm not interested in general transcriptional differential analysis. Is there a way to annotate my data with NCBI rather than Ensembl altogether?

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

Use an ncbi reference genome for mapping, then an ncbi annotation file for annotation.

ADD REPLY • link 9 months ago by BioinfGuru ★ 2.1k

0

Entering edit mode

I figured that one out, however, I fail to find these files. I am sorry for bothering you, but I am fairly new to this and I have no one in my department who would teach me know-hows. I tried CHAT-GPT and Google, but I wasn't successful. I was searching for RefSeq, but I don't know which of these files are the right equivalents to those from Ensembl. I appreciate your time and responses, thank you.

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

Go to: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

Click Download button --> Select RefSeq --> Select file type Genome Sequence FASTA and Annotations GTF.

After you download the files you will need to re-do indexing, alignments and counting. Do not use only chr10.

Use the matched files above so identifiers will match. Never mix and match data from providers (NCBI/Ensembl) or genome builds.

ADD REPLY • link 9 months ago by GenoMax 150k

0

Entering edit mode

Ok, I will follow your instructions. Thank you sir, have a nice day.

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

Homo_sapiens.GRCh38.dna.chromosome.10.fa

Where did you get this file? It may contain only chromosome 10 of human genome and not the entire genome? A complete human genome sequence file should be ~ 3 Gb.

ADD REPLY • link 9 months ago by GenoMax 150k

0

Entering edit mode

That is correct. My apologies, I should have specified that my protein of interest in located on the 10th chromosome, therefore I am saving some space.

ADD REPLY • link 9 months ago by Vojtěch ▴ 10

0

Entering edit mode

therefore I am saving some space.

As your transcriptome data is from entire genome, this is not appropriate. Aligners will make their best effort to align, so you may end up with spurious alignments, if you use a reduced representation of the reference for alignment.