Hello,
I am a beginner in bioinformatics in general, so please excuse my question if it seems easy to you.
I have identified differentially expressed genes using Deseq2. Next, I want to create a bed file containing these genes and around 2kb region of the promoter region, 1kb upstream and 1 kb downstream of the transcription start site using another gene reference file like this.
If the gene is in (-) position, the start site is column3 and the end site is column2, while if a gene is in (+) position, the start site is column2 and the end site is column 3.
Could you help me with the bioinformatics command/script using perl? Thank you very much in advance.
Chr Column2 Column3 Gene
chr1 134199214 134235457 Adora1(-)
chr1 134199214 134235457 Adora1(-)
chr1 134199214 134235457 Adora1(-)
chr1 134199214 134235457 Adora1(-)
chr1 134199218 134235052 Adora1(-)
chr1 8359738 8679095 Sntg1(-)
chr1 8359738 9250479 Sntg1(-)
chr1 8359738 9250481 Sntg1(-)
chr1 8359738 9299877 Sntg1(-)
chr1 8359738 9299877 Sntg1(-)
chr1 8359738 9299877 Sntg1(-)
chr1 8359738 9299877 Sntg1(-)
chr1 8359738 9301062 Sntg1(-)
chr1 8359738 9301063 Sntg1(-)
chr1 8359738 9301064 Sntg1(-)
chr1 8359738 9301064 Sntg1(-)
chr1 25067475 25829707 Adgrb3(-)
chr1 25067549 25830205 Adgrb3(-)
chr1 25067675 25531950 Adgrb3(-)
chr1 25067675 25559840 Adgrb3(-)
chr1 33453807 33669794 Prim2(-)
chr1 33540151 33669789 Prim2(-)
chr1 33540152 33669789 Prim2(-)
chr1 33540152 33669789 Prim2(-)
chr1 58711306 58758884 Cflar(+)
chr1 58711306 58758884 Cflar(+)
chr1 58711490 58759209 Cflar(+)
chr1 58711490 58759209 Cflar(+)
chr1 58711598 58758884 Cflar(+)
chr1 58712172 58758884 Cflar(+)
chr1 58712445 58758884 Cflar(+)
chr1 58713181 58758884 Cflar(+)
chr1 58713183 58758884 Cflar(+)
chr1 58713285 58733227 Cflar(+)
chr1 58713285 58759209 Cflar(+)
chr1 75485824 75506452 Obsl1(-)
chr1 75485824 75506658 Obsl1(-)
chr1 75485824 75506658 Obsl1(-)
chr1 75486776 75506658 Obsl1(-)
chr1 75486865 75506658 Obsl1(-)
chr1 75492631 75506658 Obsl1(-)
chr1 75496336 75506452 Obsl1(-)
chr1 125676995 125873862 Gpr39(+)
chr1 167688899 167848741 Lmx1a(+)
chr1 167688905 167848741 Lmx1a(+)
chr1 167689557 167848733 Lmx1a(+)
chr1 175962300 176214433 Pld5(-)
chr1 175962300 176276850 Pld5(-)
chr1 175962300 176276850 Pld5(-)
chr1 175962300 176276850 Pld5(-)
chr1 175962305 176213942 Pld5(-)
chr1 175962305 176275312 Pld5(-)
chr1 175963071 176274979 Pld5(-)
chr1 175963091 176213932 Pld5(-)
chr1 184527840 184557691 1700112H15Rik(-)
chr1 192885673 193035442 Syt14(-)
chr1 192885673 193035616 Syt14(-)
chr1 192885673 193035616 Syt14(-)
chr1 192891232 192980428 Syt14(-)
chr1 192891232 192986803 Syt14(-)
chr1 192891232 192986803 Syt14(-)
Can you give a bit more info as to your final goal here? This will help us to help you more effectively.
Also, when you say you want the "whole gene" do you mean the boundaries of the whole gene region on the chromosome (including introns, alternative transcripts, UTRs)? This is going to be tens of kilobases for most genes.
Hello Sir,
Thank yo for taking time in reading my question. Let's say I have identified Adora1(-) and Cflar(+) as DEGs. Next, I want to make a promoter bed file of these genes with 2kb length, 1kb upstream and 1kb downstream from TSS. Thus, I want to have a script which shows subtraction from chr start and chr end that is 1 1kb upstream from TSS and 1kb downstream from TSS. Since the genes have different signs, then Adora(-) chr start site is column3 and end site at column2, and the list show show 5 values for Adora(-) since it has 5 promoter regions:chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199218 134235052 Adora1(-) For Cplar, the start site should be column2 while end site should be column1, since it has a positive sign, and 11 values since it has 11 promoter regions. chr1 58711306 58758884 Cflar(+) chr1 58711306 58758884 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711598 58758884 Cflar(+) chr1 58712172 58758884 Cflar(+) chr1 58712445 58758884 Cflar(+) chr1 58713181 58758884 Cflar(+) chr1 58713183 58758884 Cflar(+) chr1 58713285 58733227 Cflar(+) chr1 58713285 58759209 Cflar(+) I hope this clarifies better. Thank you in advance
I'm not sure if you were trying to format your lines in some specific way, but the website didn't insert any line breaks so it's hard to understand.
I know how to make a bedfile like you're describing using python, but I don't know Perl very well.
Both python and perl have bio-libraries that will come in handy for this kind of thing, especially if you eventually want fasta files from them.
In either language, you should be able to use a simple if - else statement like below (i haven't checked this code, it's just a suggested starting point).
Good day Sir,
Thank you again for your reply. I understood what you're saying and tried to write it in a perl script: module load perl
open (IN, "mm10_NCBI-RefSeq.bed"); open (OUT, ">DEGs_promoters.bed");
while (<in>) { @data = split(/\t/,$_); @strand = split(/(|)/,$data[-1]); if ($strand[1] eq "+") {$start = $data[1];} else {$start = $data[2];}
}
close (IN);
open (CSV, "genes.csv"); @gene = split(/\t/,$_); $filter_genes = $gene[0]; if $filter_gene = $data[-1]; print OUT "$data[0]\t$s\t$e\t$data[-1]";
close (CSV); close (OUT);
My next problem is I have to make sure that I only get an output bed file of promoters only for my genes of interest, contained in column 1 or data[0] of my csv file. I cannot figure out how to program it. Thank you again.
You lost code formatting for the last few lines. Try writing out your code in a text editor, pasting it into the text box, highlighting the whole thing and then pushing the "code" button in the formatting bar (looks like 1010101).
For your specific question, you just need to make a variable that contains only your genes of interest. If you have a text file with the genes you want, you can read that in and add those values to a set or dictionary (this is very easy to do in python, you'll have to figure out how to do it in perl).
Once you've got the genes of interest in there, you can just add an if statement. Here's how I'd do it in python: