I am looking at termination data in a gtf file and I am wanting to write a function that allows me to extract the following in R: 1) get coordinate 200 nt downstream of the stop codon. 2) count the termination site. 3) if the files inputted meet the read ratio of -1 (termination sequence site) to +1.2 (1 nt downstream of termination sequence site), mark it as A.
I have put my data into a dataframe containing the count of the start and stop codon. I am wanting to apply this function to my files containing count data. I am struggling to be able to write the function which enables me to locate the coordinate within 200 nt downstream of the stop codon and take into account that the if else
statement should address that the coordinate cannot pass the start codon/should not be zero.
Any help would be massively appreciated.
Don't write this from scratch. Make use of the GenomicRanges package which can do all this, check its documentation over at Bioconductor.
Thank you. Would it be able to write this all as one function?
Probably, hard to tell without example data and an example of desired output.
Hi margo, all code in R can be wrapped into a function. However, the description of your approach is unclear and confusing, e.g. first it's 250nt then 200nt, what do you mean by "counts of genes following 1)", which counts, how and from what do you compute a read ratio? Do yourself a favor and write down and define your approach in plain English first, (maybe make a drawing too) before you start implementing it, this will help to clarify things for us and most importantly for yourself. Otherwise, you are possibly not going to get useful results.
I have now updated my question. Thank you for pointing this out.
What is the read ratio? Do you have sequencing reads of some sort?
I have a reference genome that contains start and end counts and I have multiple count reads for positive and negative strands for different sequences in bedgraph file format. The desired output would be to produce a file that can be viewed on IGV.
The files you have, if they are GTF and BED, could already be opened in IGV.
Do you mean coordinates?
Yes. So I have managed to write this code to get the coordinates 200 nt downstream however it is not using GenomicRanges package which is claimed to be easier.
Yes, that could work to give you some coordinates, however, 1) now we are at 200nt again. 2) what are you going to do with these coordinates, getting the coverage from the bed files at these locations? 3) what is the point of looking at exactly 250nt downstream of the stop codon? The UTR could stretch out for much longer.
4) Is this bacterial data, only then it may make some sense at all?
I am trying to identify termination sites. And yes, it is bacterial data.
I think you should consider two things:
You should rather focus on the whole intergenic/inter-CDS regions, then make a search for a drop of coverage.
Why are you deleting your posts?