How to plot a profile showing read coverage around the 5 prime splice site?
0
0
Entering edit mode
6.2 years ago
solo7773 ▴ 90

I'm reading an article in which there is a figure showing how many reads there are around the 5 prime splice site. I want to mimic this plot in my research. I am a newbie in analyzing sequencing data. I Googled but failed to find out how to define the 5 prime splice sites and then write them into a bed file, or where to download this sort of data from a public available data base.

Now what I know is that computeMatrix from deepTools can do this visualization once got the 5 splice sites region information. Can anybody help me out? Many thanks.

5ss profile

RNA-Seq • 2.3k views
ADD COMMENT
1
Entering edit mode

One way to do this is, get all the transcripts (CDS) co-ordinates from the gff file. Once you have CDS co-ordinates, align them by 5' region and define flanking bp (eg +/- 500 bp) and write them into bed file.

ADD REPLY
0
Entering edit mode

Thanks Chirag. I noticed gtf files from Genecode, UCSC, RefSeq are different to each other. Currently I stick to the Genecode gtf file. For some gene, it has several transcripts, and each transcript may not include all parts (eg UTR, CDS) that a mature mRNA should have.Do I need to select one isoform from the transcripts of that gene, and also to make sure the selected isoform is a mature mRNA?

ADD REPLY
0
Entering edit mode

I would suggest to take all the isoforms for a given gene. The reason for that is anyway you want to look at overall profile between the conditions. So, If there are any differences, it will be shown up in profile. However, if you want to do the same analysis at gene level, then you can think of considering which isoform to take.

ADD REPLY
0
Entering edit mode

Thanks! Sounds reasonable. I'll take your advice.

ADD REPLY
0
Entering edit mode

Hello, May I please have little bit more details on the tools which can extract the coordinates from gff file and how to align then afterwards

ADD REPLY
1
Entering edit mode

Hi, I use R package GenomicFeatures to deal with gff. See example below to get co-ordinates from gff using GenomicFeatures. I used Candida glabrata gff file for example purpose.

gff_file <- "C_glabrata_CBS138_version_s02-m07-r06_features.gff"
gff <- GenomicFeatures::makeTxDbFromGFF(gff_file, metadata = T)
genes <- GenomicFeatures::genes(gff)
genes

GRanges object with 5615 ranges and 1 metadata column:
                             seqnames      ranges strand |      gene_id
                                <Rle>   <IRanges>  <Rle> |  <character>
  CAGL0A00105g ChrA_C_glabrata_CBS138   1608-2636      - | CAGL0A00105g
  CAGL0A00110g ChrA_C_glabrata_CBS138   1608-4809      - | CAGL0A00110g
  CAGL0A00116g ChrA_C_glabrata_CBS138   2671-4809      - | CAGL0A00116g
  CAGL0A00132g ChrA_C_glabrata_CBS138 11697-13042      + | CAGL0A00132g
  CAGL0A00154g ChrA_C_glabrata_CBS138 14977-15886      + | CAGL0A00154g
           ...                    ...         ...    ... .          ...
     CaglfMt33 mito_C_glabrata_CBS138 17915-17987      + |    CaglfMt33
     CaglfMt34 mito_C_glabrata_CBS138 18004-18075      + |    CaglfMt34
     CaglfMt35 mito_C_glabrata_CBS138 18085-18156      + |    CaglfMt35
     CaglfMt36 mito_C_glabrata_CBS138 18180-18262      + |    CaglfMt36
     CaglfMt37 mito_C_glabrata_CBS138 18281-18367      + |    CaglfMt37
  -------
  seqinfo: 14 sequences from an unspecified genome; no seqlengths

genes is an object of class GRanges. You can easily convert it into R tibble or data.frame containing gene co-ordinates in separate columns.

genes %>% tibble::as_tibble()

# A tibble: 5,615 x 6
   seqnames               start   end width strand gene_id     
   <fct>                  <int> <int> <int> <fct>  <chr>       
 1 ChrA_C_glabrata_CBS138  1608  2636  1029 -      CAGL0A00105g
 2 ChrA_C_glabrata_CBS138  1608  4809  3202 -      CAGL0A00110g
 3 ChrA_C_glabrata_CBS138  2671  4809  2139 -      CAGL0A00116g
 4 ChrA_C_glabrata_CBS138 11697 13042  1346 +      CAGL0A00132g
 5 ChrA_C_glabrata_CBS138 14977 15886   910 +      CAGL0A00154g
 6 ChrA_C_glabrata_CBS138 17913 19017  1105 -      CAGL0A00165g
 7 ChrA_C_glabrata_CBS138 19120 21857  2738 -      CAGL0A00187g
 8 ChrA_C_glabrata_CBS138 22057 24493  2437 +      CAGL0A00209g
 9 ChrA_C_glabrata_CBS138 24957 27649  2693 +      CAGL0A00231g
10 ChrA_C_glabrata_CBS138 27686 28779  1094 -      CAGL0A00253g

If you want transcription start site along with flanking, use promoter function from GenomicFeatures library. For example, I want to get 500 bp upstream and 500 bp downstream from the transcription start site.

tss_flank <- GenomicFeatures::promoters(x = genes , upstream = 500, downstream = 500)
tss_flank

GRanges object with 5615 ranges and 1 metadata column:
                             seqnames      ranges strand |      gene_id
                                <Rle>   <IRanges>  <Rle> |  <character>
  CAGL0A00105g ChrA_C_glabrata_CBS138   2137-3136      - | CAGL0A00105g
  CAGL0A00110g ChrA_C_glabrata_CBS138   4310-5309      - | CAGL0A00110g
  CAGL0A00116g ChrA_C_glabrata_CBS138   4310-5309      - | CAGL0A00116g
  CAGL0A00132g ChrA_C_glabrata_CBS138 11197-12196      + | CAGL0A00132g
  CAGL0A00154g ChrA_C_glabrata_CBS138 14477-15476      + | CAGL0A00154g
           ...                    ...         ...    ... .          ...
     CaglfMt33 mito_C_glabrata_CBS138 17415-18414      + |    CaglfMt33
     CaglfMt34 mito_C_glabrata_CBS138 17504-18503      + |    CaglfMt34
     CaglfMt35 mito_C_glabrata_CBS138 17585-18584      + |    CaglfMt35
     CaglfMt36 mito_C_glabrata_CBS138 17680-18679      + |    CaglfMt36
     CaglfMt37 mito_C_glabrata_CBS138 17781-18780      + |    CaglfMt37
  -------
  seqinfo: 14 sequences from an unspecified genome; no seqlengths
ADD REPLY
0
Entering edit mode

Hello, Thanks for your reply and for sharing the codes, that's very clear. I just want to ask how can I get coordinates around 3' splice site instead TSS so what can I use instead of the promoters for the GenomicFeatures.

Thanks

ADD REPLY
0
Entering edit mode

You can use function flank. See the example below. I am using data from the same example given in previous reply.

tes_flank <- flank(genes, start = FALSE ,width = 5, both = TRUE) ## setting start = FALSE gives flank co-ordinated form the end (3' region if you give genes for input). Setting both = TRUE gives flank region from either side. 

GRanges object with 5615 ranges and 1 metadata column:
                             seqnames      ranges strand |      gene_id
                                <Rle>   <IRanges>  <Rle> |  <character>
  CAGL0A00105g ChrA_C_glabrata_CBS138   1603-1612      - | CAGL0A00105g
  CAGL0A00110g ChrA_C_glabrata_CBS138   1603-1612      - | CAGL0A00110g
  CAGL0A00116g ChrA_C_glabrata_CBS138   2666-2675      - | CAGL0A00116g
  CAGL0A00132g ChrA_C_glabrata_CBS138 13038-13047      + | CAGL0A00132g
  CAGL0A00154g ChrA_C_glabrata_CBS138 15882-15891      + | CAGL0A00154g
           ...                    ...         ...    ... .          ...
     CaglfMt33 mito_C_glabrata_CBS138 17983-17992      + |    CaglfMt33
     CaglfMt34 mito_C_glabrata_CBS138 18071-18080      + |    CaglfMt34
     CaglfMt35 mito_C_glabrata_CBS138 18152-18161      + |    CaglfMt35
     CaglfMt36 mito_C_glabrata_CBS138 18258-18267      + |    CaglfMt36
     CaglfMt37 mito_C_glabrata_CBS138 18363-18372      + |    CaglfMt37
ADD REPLY

Login before adding your answer.

Traffic: 3520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6