Entering edit mode
19 months ago
Dan
▴
180
Hello:
I mapped the reads in fastq
file of bulk RNA seq data to reference genome and get the bam
file. My samples have a gene mutation that can induce transcript prematurely terminate in introns. How can I subset a Bam
file and quantify the reads according to gene components "5'UTR, 1st exon, 1st intron, ..., 3'UTR, intergenic..."? Thanks
What type of sequencing data is this, what was done to make the BAM file, and what question(s) are you trying to answer with this?
I am sorry I edited the question to include these messages. Thanks
What sorts of tools are you familiar with? Linux command line? R? etc. The BAM file represents coordinates per read, your features (gene component descriptions) are in some other kind of file, probably a GTF file or a BED file, and they also have coordinates. There are many ways to go from a description of features to counting or extracting the reads which fall within the feature coordinates. Depending on your familiarity, and which tool sets you choose, you could answer the question for 1 intron, or all introns. Is there a publication that does something similar to what you are thinking? Do you need to compare this data to other data sets? (i.e. the wt, which presumably does not terminate prematurely?)
I am familiar with both R and Linux command line. I have a GTF file downloaded from
ensembl
(https://useast.ensembl.org/info/data/ftp/index.html), but the GTF file only has gene start and end site, I did not find the exon, intron and UTR coordinates in it. I have mutated and wt samples. I runqualimap
(http://qualimap.conesalab.org/doc_html/command_line.html#rna-seq-qc) before, and there is indeed an increase of intron sequences in the mutated samples compared with the wt samples, I don't know whetherqualimap
can separate and quantify every exon and intron. ThanksThe GTF file should absolutely have exon information -- entries where column 3 is
exon
. Cross reference theexon
by matching itstrascript_id
to atranscript
(again, column 3) entry, and transcript to genes by matching itsgene_id
to agene
entry. Exons are annotated withexon_number
. Regions prior to exon 1 (be careful for negative strand genes) and after the last exons are your 3' and 5' UTRs*; and regions between exons are your introns.* Gencode/ensembl contain CAGE-suported UTR definitions for some genes, these are tagged in column three via
UTR
I checked the
GTF
file usingcat hg19_Ensemble.gtf | cut -f3 | sort | uniq -c
There is no information about
UTR
, how to define theUTR
region and the intergenic region? How to know the whole gene length? ThanksIf you get the GTF file from GENCODE - GRCh38 or GRCh37 then that has UTRs.
Yes. Thanks