Question

Extract Reads From A Bam file according to gene components `5'UTR, 1st exon, 1st intron, ..., 3'UTR, intergenic, ...`

1

Entering edit mode

19 months ago

Dan ▴ 180

Hello:

I mapped the reads in fastq file of bulk RNA seq data to reference genome and get the bam file. My samples have a gene mutation that can induce transcript prematurely terminate in introns. How can I subset a Bam file and quantify the reads according to gene components "5'UTR, 1st exon, 1st intron, ..., 3'UTR, intergenic..."? Thanks

BAM • 2.9k views

ADD COMMENT • link 18 months ago by Dan ▴ 180

1

Entering edit mode

What type of sequencing data is this, what was done to make the BAM file, and what question(s) are you trying to answer with this?

ADD REPLY • link 19 months ago by rpolicastro 13k

1

Entering edit mode

I am sorry I edited the question to include these messages. Thanks

ADD REPLY • link 19 months ago by Dan ▴ 180

1

Entering edit mode

What sorts of tools are you familiar with? Linux command line? R? etc. The BAM file represents coordinates per read, your features (gene component descriptions) are in some other kind of file, probably a GTF file or a BED file, and they also have coordinates. There are many ways to go from a description of features to counting or extracting the reads which fall within the feature coordinates. Depending on your familiarity, and which tool sets you choose, you could answer the question for 1 intron, or all introns. Is there a publication that does something similar to what you are thinking? Do you need to compare this data to other data sets? (i.e. the wt, which presumably does not terminate prematurely?)

ADD REPLY • link 19 months ago by seidel 11k

0

Entering edit mode

I am familiar with both R and Linux command line. I have a GTF file downloaded from ensembl (https://useast.ensembl.org/info/data/ftp/index.html), but the GTF file only has gene start and end site, I did not find the exon, intron and UTR coordinates in it. I have mutated and wt samples. I run qualimap (http://qualimap.conesalab.org/doc_html/command_line.html#rna-seq-qc) before, and there is indeed an increase of intron sequences in the mutated samples compared with the wt samples, I don't know whether qualimap can separate and quantify every exon and intron. Thanks

ADD REPLY • link 19 months ago by Dan ▴ 180

1

Entering edit mode

The GTF file should absolutely have exon information -- entries where column 3 is exon. Cross reference the exon by matching itstrascript_id to a transcript (again, column 3) entry, and transcript to genes by matching its gene_id to a gene entry. Exons are annotated with exon_number. Regions prior to exon 1 (be careful for negative strand genes) and after the last exons are your 3' and 5' UTRs*; and regions between exons are your introns.

* Gencode/ensembl contain CAGE-suported UTR definitions for some genes, these are tagged in column three via UTR

ADD REPLY • link 19 months ago by LChart 4.5k

0

Entering edit mode

I checked the GTF file using cat hg19_Ensemble.gtf | cut -f3 | sort | uniq -c

 697776 CDS
1069721 exon
  73470 start_codon
  68400 stop_codon

There is no information about UTR, how to define the UTR region and the intergenic region? How to know the whole gene length? Thanks

ADD REPLY • link 19 months ago by Dan ▴ 180

1

Entering edit mode

If you get the GTF file from GENCODE - GRCh38 or GRCh37 then that has UTRs.

$ zgrep UTR gencode.v43lift37.annotation.gtf.gz | wc -l
648762

ADD REPLY • link 19 months ago by GenoMax 147k

0

Entering edit mode

Yes. Thanks

ADD REPLY • link 19 months ago by Dan ▴ 180

score 2 · Accepted Answer · 2023-05-13

The Rsubread package can be used to quantify the reads according to the gene regions in a gtf file (https://bioconductor.org/packages/release/bioc/html/Rsubread.html)

library(Rsubread)
library(DESeq2)

sample_dir <- "exon_Counts/"
bamsToCount <- dir("bams/", full.names = TRUE, pattern = "*.\\.bam$")


fcResults <- featureCounts(bamsToCount,  annot.ext = "gtf/Homo_sapiens.GRCh37.75.gtf", 
isGTFAnnotationFile = T, GTF.attrType = c("gene_name","exon_number"), isPairedEnd = T, 
nthreads = 16L,  strandSpecific = 1)

myCounts <- fcResults$counts
colnames(myCounts) <- c("117", "174",
"322","328",
"146", "145",
"150","152")

write.table(myCounts, paste0(sample_dir, "exon_Counts.csv"), sep = ",", 
quote = FALSE, row.names = T, col.names = T)