Hi I have the following code:
grep -w -F -i -f gene_list.txt gencode.vM18.promoters.bed > gene_list_promoters.bed
head gene_list.txt
Rn45s
Malat1
Sptbn1
head gencode.vM18.promoters.bed
chr1 3071252 3075252 ENSMUSG00000102693.1 . + HAVANA gene . ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1
chr1 3100015 3104015 ENSMUSG00000064842.1 . + ENSEMBL gene . ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3
chr1 3669498 3673498 ENSMUSG00000051951.5 . - HAVANA gene . ID=ENSMUSG00000051951.5;gene_id=ENSMUSG00000051951.5;gene_type=protein_coding;gene_name=Xkr4;level=2;havana_gene=OTTMUSG00000026353.2
It retrieves the promoters specific to the .txt list of genes, however this is an ordered gene list and the promoters that are retrieved are ordered by chromosome, wondering what I can add to the code the order the promoters in the same order as gene list? Thanks.
You should add the output of
head
for each of the two input files.In all probability, you'll either need to do the sorting in a separate step or use process substitution.
Just added the
head
outputWould this work?
Update:
instead of
grep ${line}
please usegrep "gene_name=${line};"
or something similar.Also If you are looking to extract based on gene_list.txt I think your code has a bug, because I'm only testing on three first genes and first 3 lines of gencode.vM18.promoters.bed and even though these genes are not found in the toy bed file your code outputs something.
Toy Gene list
Toy bed file
Your code:
Your grep will pick the entry for
A1BG-AS1
when grepping forA1BG
.Thanks for pointing this out.
I think
grep "gene_name=${line};"
should fix this.This does seem to work in terms of ordering the sites however the gencode list has 54K rows and my gene list has 14k but I get a final tally of 140K rows. Is that due to alternate promoters?
>>
appends, so each time you run the command it adds to the same file. You might want to delete the resultfile (gene_list_promoters.bed ) and run again because the result of the first run is not correct.Also please make sure your gene_list.txt doesn't have duplicate names. Use
To make sure it's equal the number of lines in gene_list.txt