grep sorting question
1
0
Entering edit mode
4.8 years ago
rbronste ▴ 420

Hi I have the following code:

grep -w -F -i -f  gene_list.txt gencode.vM18.promoters.bed >  gene_list_promoters.bed

head gene_list.txt

    Rn45s
    Malat1
    Sptbn1


head gencode.vM18.promoters.bed

chr1    3071252 3075252 ENSMUSG00000102693.1    .   +   HAVANA  gene    .   ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1
chr1    3100015 3104015 ENSMUSG00000064842.1    .   +   ENSEMBL gene    .   ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3
chr1    3669498 3673498 ENSMUSG00000051951.5    .   -   HAVANA  gene    .   ID=ENSMUSG00000051951.5;gene_id=ENSMUSG00000051951.5;gene_type=protein_coding;gene_name=Xkr4;level=2;havana_gene=OTTMUSG00000026353.2

It retrieves the promoters specific to the .txt list of genes, however this is an ordered gene list and the promoters that are retrieved are ordered by chromosome, wondering what I can add to the code the order the promoters in the same order as gene list? Thanks.

grep awk sed bash unix • 1.7k views
ADD COMMENT
1
Entering edit mode

You should add the output of head for each of the two input files.

In all probability, you'll either need to do the sorting in a separate step or use process substitution.

ADD REPLY
0
Entering edit mode

Just added the head output

ADD REPLY
1
Entering edit mode

Would this work?

 cat gene_list.txt | while read line ; do grep ${line} gencode.vM18.promoters.bed >>  gene_list_promoters.bed ; done

Update:

instead of grep ${line} please use grep "gene_name=${line};" or something similar.

cat gene_list.txt | while read line ; do grep "gene_name=${line};" gencode.vM18.promoters.bed >>  gene_list_promoters.bed ; done

Also If you are looking to extract based on gene_list.txt I think your code has a bug, because I'm only testing on three first genes and first 3 lines of gencode.vM18.promoters.bed and even though these genes are not found in the toy bed file your code outputs something.

Toy Gene list

cat gene_list.txt 
Rn45s
Malat1
Sptbn1

Toy bed file

cat gencode.vM18.promoters.bed
chr1    3071252 3075252 ENSMUSG00000102693.1    .   +   HAVANA  gene    .   ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1
chr1    3100015 3104015 ENSMUSG00000064842.1    .   +   ENSEMBL gene    .   ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3
chr1    3669498 3673498 ENSMUSG00000051951.5    .   -   HAVANA  gene    .   ID=ENSMUSG00000051951.5;gene_id=ENSMUSG00000051951.5;gene_type=protein_coding;gene_name=Xkr4;level=2;havana_gene=OTTMUSG00000026353.2

Your code:

grep -w -F -i -f  gene_list.txt gencode.vM18.promoters.bed 
chr1    3071252 3075252 ENSMUSG00000102693.1    .   +   HAVANA  gene    .   ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1
chr1    3100015 3104015 ENSMUSG00000064842.1    .   +   ENSEMBL gene    .   ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3
chr1    3669498 3673498 ENSMUSG00000051951.5    .   -   HAVANA  gene    .   ID=ENSMUSG00000051951.5;gene_id=ENSMUSG00000051951.5;gene_type=protein_coding;gene_name=Xkr4;level=2;havana_gene=OTTMUSG00000026353.2
ADD REPLY
0
Entering edit mode

Your grep will pick the entry for A1BG-AS1 when grepping for A1BG.

ADD REPLY
0
Entering edit mode

Thanks for pointing this out.

I think grep "gene_name=${line};" should fix this.

cat gene_list.txt | while read line ; do grep "gene_name=${line};" gencode.vM18.promoters.bed >> gene_list_promoters.bed ; done
ADD REPLY
0
Entering edit mode

This does seem to work in terms of ordering the sites however the gencode list has 54K rows and my gene list has 14k but I get a final tally of 140K rows. Is that due to alternate promoters?

ADD REPLY
0
Entering edit mode

>> appends, so each time you run the command it adds to the same file. You might want to delete the resultfile (gene_list_promoters.bed ) and run again because the result of the first run is not correct.

Also please make sure your gene_list.txt doesn't have duplicate names. Use

cat gene_list.txt | sort | uniq | wc - l

To make sure it's equal the number of lines in gene_list.txt

ADD REPLY
1
Entering edit mode
4.8 years ago
Ram 44k

What is the purpose of this exercise?

I think the best way to do this would be to use R, especially if the number of lines in the two files are not the same. If you're willing to abuse bash and re-create the file, you can loop through gene_list.txt line by line and run the grep for each line, >>-ing to the output file. But like I said, that would be abusing the shell.

ADD COMMENT
0
Entering edit mode

The purpose is simple. The genes are in an order of descending normalized signal strength in the txt file, I would like the same in the TSS interval file for those specific genes.

ADD REPLY
0
Entering edit mode

Why does the order of entries in a BED file matter? Is there a tool that works only with your custom order?

ADD REPLY
0
Entering edit mode

Yes thats right, Im going to use the resulting BED file for a heatmap where the descending order is important.

ADD REPLY

Login before adding your answer.

Traffic: 1639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6