I have extracted the Transcription Start Sites (TSS) from hsapiens gene ensembl database and TSS flnaking regions through biomaRt and GenomicRanges. Then all targeted data has been stored in a bed format file. The code looks like this:
library(biomaRt)
library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg38)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)
transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end","strand", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
filters ="biotype",
values =c("protein_coding"),
mart =mart)
transcripts[transcripts$strand == 1,]$strand <- "+"
transcripts[transcripts$strand == -1,]$strand <- "-"
trans.range <- makeGRangesFromDataFrame(transcripts, keep.extra.columns = TRUE, ignore.strand = FALSE, start.field = "transcript_start", end.field = "transcript_end", strand.field = "strand")
#flank regions
flank.regions <- promoters(trans.range, upstream=1500, downstream=500, use.names=TRUE)
#to just keeping the interval ranges (IR) of standard chromosomes:
stand.flank <- flank.regions[seqnames(flank.regions) == c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22"),]
#change the levels to the only autosomal chr!
seq(1:22)
seqlevels(stand.flank) <- as.character(seq(1:22))
#split granges based on metadata elements (ensembl_gene_id)
gen.splitted <- split(stand.flank, stand.flank$ensembl_gene_id)
#reduce gen.splitted to get reduced intervals regions for each ensembl_gene_id in one list seperately
gen.reduced <- reduce(gen.splitted)
#to create bed.format file;
gen.bed <- unlist(gen.reduced)
#sort GRanges based on seqnames
gen.bed <- sortSeqlevels(gen.bed)
gen.bed <- sort(gen.bed)
#i want the "ensembl_gene_id" values to be stored in the "names" column of the BED file
names=elementMetadata(gen.bed)$ensembl_gene_id
#To write data in a Granges object to a bed File.
gen.bedfile <- data.frame(seqnames=seqnames(gen.bed),
starts=start(gen.bed)-1,
ends=end(gen.bed),
names=names(gen.bed),
scores=c(rep(".", length(gen.bed))),
#To put a "." instead of score value in each score column!
strands=strand(gen.bed))
write.table(df, file="foo.bed", quote=F, sep="\t", row.names=F, col.names=F)
#split bed file based on seqnames(chromosome number)
gen.bed.splitted <- split(gen.bedfile, gen.bedfile$seqnames)
#Remove empty zero length rows in a list
gen.bedfile <- gen.bed.splitted[sapply(gen.bed.splitted, nrow)>0]
Now, I'd like to create a snakefile following snakemake mannual to create a small pipeline. Since, I am new in snakemake, I'd appreciate if someone could guide me how would be the input of the first rule where I think it needs to call the hsapiens gene ensembl database. And what about the shell section! because I heard it is not a good idea to transform R script into Snakemake directly, and the Python code would be a better choice. Thanks a lot.
your script didn't take any input paramater (correct ?). So I don't understand the point of doing a snakemake file of that ? Is this R script one step in your pipeline ? FYI a simple call
Rscript your_script.R
is ok to call the scriptThank you for the reply. I want to make a workflow as Snakefile looking at my R scripts and yes it would be a begining part of workflow. I have just extracted the target regions considering the applied attributes and filters by getBM function in Biomart package. The source of all, is "hsapiens gene ensembl database". I actually don't know what could be supposed as input in the Snakefile's first rule in this case!! Also, I have tried to put my R scripts directly into the "shell" section of the rule using:
but it does not work. The corresponding error is about "Unexpected keyword Rscript in rule definition" and even "SyntaxError".
What about something like