How to locate and extract data from a df in R
1
0
Entering edit mode
2.5 years ago
margo ▴ 40

I am looking at termination data in a gtf file and I am wanting to write a function that allows me to extract the following in R: 1) get coordinate 200 nt downstream of the stop codon. 2) count the termination site. 3) if the files inputted meet the read ratio of -1 (termination sequence site) to +1.2 (1 nt downstream of termination sequence site), mark it as A.

I have put my data into a dataframe containing the count of the start and stop codon. I am wanting to apply this function to my files containing count data. I am struggling to be able to write the function which enables me to locate the coordinate within 200 nt downstream of the stop codon and take into account that the if else statement should address that the coordinate cannot pass the start codon/should not be zero.

Any help would be massively appreciated.

gtf sequencing R igv • 2.0k views
ADD COMMENT
1
Entering edit mode

Don't write this from scratch. Make use of the GenomicRanges package which can do all this, check its documentation over at Bioconductor.

ADD REPLY
0
Entering edit mode

Thank you. Would it be able to write this all as one function?

ADD REPLY
0
Entering edit mode

Probably, hard to tell without example data and an example of desired output.

ADD REPLY
0
Entering edit mode

Hi margo, all code in R can be wrapped into a function. However, the description of your approach is unclear and confusing, e.g. first it's 250nt then 200nt, what do you mean by "counts of genes following 1)", which counts, how and from what do you compute a read ratio? Do yourself a favor and write down and define your approach in plain English first, (maybe make a drawing too) before you start implementing it, this will help to clarify things for us and most importantly for yourself. Otherwise, you are possibly not going to get useful results.

ADD REPLY
0
Entering edit mode

I have now updated my question. Thank you for pointing this out.

ADD REPLY
0
Entering edit mode

What is the read ratio? Do you have sequencing reads of some sort?

ADD REPLY
0
Entering edit mode

I have a reference genome that contains start and end counts and I have multiple count reads for positive and negative strands for different sequences in bedgraph file format. The desired output would be to produce a file that can be viewed on IGV.

ADD REPLY
1
Entering edit mode

The desired output would be to produce a file that can be viewed on IGV.

The files you have, if they are GTF and BED, could already be opened in IGV.

ADD REPLY
0
Entering edit mode

start and end counts

Do you mean coordinates?

ADD REPLY
0
Entering edit mode

Yes. So I have managed to write this code to get the coordinates 200 nt downstream however it is not using GenomicRanges package which is claimed to be easier.

df$TES_end = ifelse(df$strand=="-", df$start-200, df$end+200)
df$TES_end = ifelse(df$TES_end<1, 1, df$TES_end)
ADD REPLY
0
Entering edit mode

Yes, that could work to give you some coordinates, however, 1) now we are at 200nt again. 2) what are you going to do with these coordinates, getting the coverage from the bed files at these locations? 3) what is the point of looking at exactly 250nt downstream of the stop codon? The UTR could stretch out for much longer.

4) Is this bacterial data, only then it may make some sense at all?

ADD REPLY
0
Entering edit mode

I am trying to identify termination sites. And yes, it is bacterial data.

ADD REPLY
0
Entering edit mode

I think you should consider two things:

  1. Your transcriptional termination site could be at any distance to the stop codon, if it exists
  2. Your arbitrarily chosen distance of 250 nt could overlap with a neighboring gene.

You should rather focus on the whole intergenic/inter-CDS regions, then make a search for a drop of coverage.

ADD REPLY
0
Entering edit mode

Why are you deleting your posts?

ADD REPLY
1
Entering edit mode
2.5 years ago

To be honest, I never really got the hang of dealing with BED files/genomic ranges in R. I personally find command line based tools such as BEDtools or bedops more user-friendly for these types of operations. This interactive tutorial may give you a sense of what you could do with these tools.

ADD COMMENT

Login before adding your answer.

Traffic: 2841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6