Question

Picking out the first occurrence of a gene

0

Entering edit mode

7.1 years ago

vinayjrao ▴ 260

I have a file -

gene_name chr start end

FAM138A chr1 34553 36081

FAM138A chr1 35244 36073

OR4F5 chr1 69090 70008

RP11-34P13.7 chr1 89294 120932

RP11-34P13.8 chr1 89550 91105

RP11-34P13.7 chr1 92229 129217

I want to pick out the first occurrence of each gene as it would give me the longest transcript. Any help on doing the same would be appreciated.

Thank you.

grep awk • 1.5k views

ADD COMMENT • link updated 7.1 years ago by cpad0112 21k • written 7.1 years ago by vinayjrao ▴ 260

0

Entering edit mode

What have you tried? It's good practice to show the effort you took to solve this issue, rather than just asking us to solve it completely.

e.g. if you show a bit of Python code I could fix it for you, or show your awk code and you'll automatically summon Pierre Lindenbaum

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

I have been trying grep. awk, I don't understand very well, so I am keeping that as an option. python, I have no understanding of. I tried grep --max-count=1 "FAM138A" filename and got the desired result, but I want to know how to automate for each gene.

Thanks again.

ADD REPLY • link 7.1 years ago by vinayjrao ▴ 260

1

Entering edit mode

Is this thread helpful? https://unix.stackexchange.com/questions/160009/remove-entire-row-in-a-file-if-first-column-is-repeated Googled for only keep unique rows based on column unix

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

That worked perfectly. Thank you very much

ADD REPLY • link 7.1 years ago by vinayjrao ▴ 260

score 0 · Answer 1 · 2017-12-22

Hello,

you can do something like this:

cut -f 1 filename|tail -n+2|sort|uniq|parallel grep --max-count=1 {} filename

cut -f 1 filename give us the first column with the genames.

With tail -n+2 we get rid of the first line containing the header.

We than sort the list of geneames as uniq just look for duplicates at the next line(s).

So we end up with a list of all gennames. Using parallel we can pass this list to grep the first occurrence of the gename.

fin swimmer

score 0 · Answer 2 · 2017-12-22

0

Entering edit mode

7.1 years ago

cpad0112 21k

try this. output:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt 
GroupBy(gene_name)  GroupBy(chr)    first(start)    first(end)
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105

input:

$ cat test.txt 
gene_name   chr start   end
FAM138A chr1    34553   36081
FAM138A chr1    35244   36073
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105
RP11-34P13.7    chr1    92229   129217

ADD COMMENT • link 7.1 years ago by cpad0112 21k

0

Entering edit mode

to format output header:

$ datamash -sH  -g1,2 first 3 first 4  < test.txt | sed '1 s/\w\+\W\(\w\+\)\W/\1/g' 
gene_name   chr start   end
FAM138A chr1    34553   36081
OR4F5   chr1    69090   70008
RP11-34P13.7    chr1    89294   120932
RP11-34P13.8    chr1    89550   91105

ADD REPLY • link 7.1 years ago by cpad0112 21k