Hey guys, I am new here, so I hope this is how it works. I am working on my master's thesis and for that, I need to analyse some bulk RNAseq data. I start with SRA files and I am doing the basic downstream analysis just fine. I end with .bam files which I need to annotate. I am using txdb objects. However, the protein of interest has different transcriptional variants on NCBI and on Ensembl and txdb uses only Ensembl, if I understand it correctly. Therefore I am asking you for help – is there a way to annotate my .bam files with NCBI data? I should add that I am interested in transcripts rather than whole genes (in Ensembl, I am using ESNT, not ENSG). How could I go about it?
# Making txdb object
txdb <- makeTxDbFromGFF("/genome/Homo_sapiens.GRCh38.111.chr.gtf")
transcripts(txdb)
# Then I am choosing to work with 10th chromosome only
txdb2 <- keepSeqlevels(txdb, seqlevels_in_bam)
ebg <- exonsBy(txdb2, by="tx", use.names = TRUE)
seqlevels(ebg)
ebg
# Counting
names(BamFileList(bamfiles))
any(duplicated(names(BamFileList(bamfiles))))
se <- GenomicAlignments::summarizeOverlaps(features=ebg, reads=bamfiles,
mode="Union",
ignore.strand=TRUE)
se
head(assay(se))
write.csv(assay(se), file=paste(name,".csv", sep = ""))
Now I have list of ENSTs present in my reads. How do I do something similar but with NCBI annotation so I have NM/NP?
Thanks guys!
Curious as to why you want to use
NM/NP
ID's. People generally usegene names
.Maybe I am not familiar enough with gene names, but I need transcripts rather than genes. Therefore now I am using ENSTs, to which NMs should be equivalent?
Are you sure about that? Unless people want to do differential transcript analysis, counts are normally summarized at the gene level.
Yes, I am very positive on this, because my thesis is about transcriptional variants of certain protein in human brain, therefore I need transcripts. And that complicates it, because I can't find many useful pipelines to use.