Question

How to locate and extract data from a df in R

0

Entering edit mode

2.7 years ago

margo ▴ 40

I am looking at termination data in a gtf file and I am wanting to write a function that allows me to extract the following in R: 1) get coordinate 200 nt downstream of the stop codon. 2) count the termination site. 3) if the files inputted meet the read ratio of -1 (termination sequence site) to +1.2 (1 nt downstream of termination sequence site), mark it as A.

I have put my data into a dataframe containing the count of the start and stop codon. I am wanting to apply this function to my files containing count data. I am struggling to be able to write the function which enables me to locate the coordinate within 200 nt downstream of the stop codon and take into account that the if else statement should address that the coordinate cannot pass the start codon/should not be zero.

Any help would be massively appreciated.

gtf sequencing R igv • 2.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 2.7 years ago by margo ▴ 40

1

Entering edit mode

Don't write this from scratch. Make use of the GenomicRanges package which can do all this, check its documentation over at Bioconductor.

ADD REPLY • link 2.7 years ago by ATpoint 87k

0

Entering edit mode

Thank you. Would it be able to write this all as one function?

ADD REPLY • link 2.7 years ago by margo ▴ 40

0

Entering edit mode

Probably, hard to tell without example data and an example of desired output.

ADD REPLY • link 2.7 years ago by ATpoint 87k

0

Entering edit mode

Hi margo, all code in R can be wrapped into a function. However, the description of your approach is unclear and confusing, e.g. first it's 250nt then 200nt, what do you mean by "counts of genes following 1)", which counts, how and from what do you compute a read ratio? Do yourself a favor and write down and define your approach in plain English first, (maybe make a drawing too) before you start implementing it, this will help to clarify things for us and most importantly for yourself. Otherwise, you are possibly not going to get useful results.

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I have now updated my question. Thank you for pointing this out.

ADD REPLY • link 2.7 years ago by margo ▴ 40

0

Entering edit mode

What is the read ratio? Do you have sequencing reads of some sort?

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I have a reference genome that contains start and end counts and I have multiple count reads for positive and negative strands for different sequences in bedgraph file format. The desired output would be to produce a file that can be viewed on IGV.

ADD REPLY • link 2.7 years ago by margo ▴ 40

1

Entering edit mode

The desired output would be to produce a file that can be viewed on IGV.

The files you have, if they are GTF and BED, could already be opened in IGV.

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

start and end counts

Do you mean coordinates?

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

Yes. So I have managed to write this code to get the coordinates 200 nt downstream however it is not using GenomicRanges package which is claimed to be easier.

df$TES_end = ifelse(df$strand=="-", df$start-200, df$end+200)
df$TES_end = ifelse(df$TES_end<1, 1, df$TES_end)

ADD REPLY • link 2.7 years ago by margo ▴ 40

0

Entering edit mode

Yes, that could work to give you some coordinates, however, 1) now we are at 200nt again. 2) what are you going to do with these coordinates, getting the coverage from the bed files at these locations? 3) what is the point of looking at exactly 250nt downstream of the stop codon? The UTR could stretch out for much longer.

4) Is this bacterial data, only then it may make some sense at all?

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

I am trying to identify termination sites. And yes, it is bacterial data.

ADD REPLY • link 2.7 years ago by margo ▴ 40

0

Entering edit mode

I think you should consider two things:

Your transcriptional termination site could be at any distance to the stop codon, if it exists
Your arbitrarily chosen distance of 250 nt could overlap with a neighboring gene.

You should rather focus on the whole intergenic/inter-CDS regions, then make a search for a drop of coverage.

ADD REPLY • link 2.7 years ago by Michael 55k

0

Entering edit mode

Why are you deleting your posts?

ADD REPLY • link 2.6 years ago by Ram 44k

score 1 · Answer 1 · 2022-06-21

1

Entering edit mode

2.7 years ago

Friederike 9.0k

To be honest, I never really got the hang of dealing with BED files/genomic ranges in R. I personally find command line based tools such as BEDtools or bedops more user-friendly for these types of operations. This interactive tutorial may give you a sense of what you could do with these tools.

ADD COMMENT • link 2.7 years ago by Friederike 9.0k