Create snakefile based on R script
1
0
Entering edit mode
6.0 years ago
r.tor ▴ 50

I have extracted the Transcription Start Sites (TSS) from hsapiens gene ensembl database and TSS flnaking regions through biomaRt and GenomicRanges. Then all targeted data has been stored in a bed format file. The code looks like this:

library(biomaRt)
library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg38)

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end","strand",        "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
  filters ="biotype",
  values  =c("protein_coding"),
  mart    =mart)

transcripts[transcripts$strand == 1,]$strand <- "+"
transcripts[transcripts$strand == -1,]$strand <- "-"

trans.range <- makeGRangesFromDataFrame(transcripts, keep.extra.columns = TRUE, ignore.strand =  FALSE, start.field = "transcript_start", end.field = "transcript_end", strand.field = "strand")

 #flank regions
 flank.regions <- promoters(trans.range, upstream=1500, downstream=500, use.names=TRUE)

 #to just keeping the interval ranges (IR) of standard chromosomes:
 stand.flank <-  flank.regions[seqnames(flank.regions) ==     c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22"),]

#change the levels to the only autosomal chr!
seq(1:22)
seqlevels(stand.flank) <- as.character(seq(1:22))

#split granges based on metadata elements (ensembl_gene_id)

gen.splitted <- split(stand.flank, stand.flank$ensembl_gene_id)

#reduce gen.splitted to get reduced intervals regions for each ensembl_gene_id in one list seperately
gen.reduced <- reduce(gen.splitted)

#to create bed.format file;
gen.bed <- unlist(gen.reduced)

#sort GRanges based on seqnames
gen.bed <- sortSeqlevels(gen.bed)
gen.bed <- sort(gen.bed)

#i want the "ensembl_gene_id" values to be stored in the "names" column of the BED file
names=elementMetadata(gen.bed)$ensembl_gene_id

 #To write data in a Granges object to a bed File.
 gen.bedfile <- data.frame(seqnames=seqnames(gen.bed),
 starts=start(gen.bed)-1,
 ends=end(gen.bed),
 names=names(gen.bed),
 scores=c(rep(".", length(gen.bed))), 

 #To put a "." instead of score value in each score column!
 strands=strand(gen.bed))
 write.table(df, file="foo.bed", quote=F, sep="\t", row.names=F, col.names=F)

#split bed file based on seqnames(chromosome number)
gen.bed.splitted <- split(gen.bedfile, gen.bedfile$seqnames)

#Remove empty zero length rows in a list
gen.bedfile <- gen.bed.splitted[sapply(gen.bed.splitted, nrow)>0]

Now, I'd like to create a snakefile following snakemake mannual to create a small pipeline. Since, I am new in snakemake, I'd appreciate if someone could guide me how would be the input of the first rule where I think it needs to call the hsapiens gene ensembl database. And what about the shell section! because I heard it is not a good idea to transform R script into Snakemake directly, and the Python code would be a better choice. Thanks a lot.

R snakemake python snakefile • 5.0k views
ADD COMMENT
0
Entering edit mode

your script didn't take any input paramater (correct ?). So I don't understand the point of doing a snakemake file of that ? Is this R script one step in your pipeline ? FYI a simple call Rscript your_script.R is ok to call the script

ADD REPLY
0
Entering edit mode

Thank you for the reply. I want to make a workflow as Snakefile looking at my R scripts and yes it would be a begining part of workflow. I have just extracted the target regions considering the applied attributes and filters by getBM function in Biomart package. The source of all, is "hsapiens gene ensembl database". I actually don't know what could be supposed as input in the Snakefile's first rule in this case!! Also, I have tried to put my R scripts directly into the "shell" section of the rule using:

rule tss_ranges:
output:
    "transcripts.txt"
run:
robjects.r(("""

    mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    att <- listAttributes(mart)

    {output} = getBM(attributes=c("chromosome_name", "transcript_start",
    "transcript_end","strand", "ensembl_gene_id","gene_biotype",
    "ensembl_transcript_id"),
    filters ="biotype",
    values  =c("protein_coding"),
    mart    =mart)

    """))

but it does not work. The corresponding error is about "Unexpected keyword Rscript in rule definition" and even "SyntaxError".

ADD REPLY
2
Entering edit mode

What about something like

rule tss_ranges:
    output:
        "transcripts.txt"
    shell:
        "Rscript your_script.R"
ADD REPLY
2
Entering edit mode
6.0 years ago

The easiest would probably be to just save your code as an Rscript (mycode.R), and in the shell part run Rscript mycode.R, optionally with arguments.

ADD COMMENT
0
Entering edit mode

Hi, you mean is to put all R scripts in one rule? and what about the input section of the rule?

ADD REPLY
1
Entering edit mode

I think you serisouly need to read & understand snakemake tutorial

ADD REPLY
1
Entering edit mode

From https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html:

the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

Input would be the file you would supply to R for loading and output is expected output.

ADD REPLY
0
Entering edit mode

Which input does that code take?

ADD REPLY
0
Entering edit mode

just downloading the ordered regions from database!

ADD REPLY

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6