Question

Sum the values based on the variant names and give the result in stats format

0

Entering edit mode

22 months ago

smrutimayipanda ▴ 20

I have a text file in which the contents are the following:

 3 synonymous_variant
      1 missense_variant
      1 EFFECT
      1 downstream_gene_variant
      6 missense_variant
      2 upstream_gene_variant
      2 synonymous_variant
      1 EFFECT
      1 downstream_gene_variant
      4 missense_variant
      3 synonymous_variant
      1 upstream_gene_variant
      1 EFFECT
      1 downstream_gene_variant
      3 synonymous_variant
      3 missense_variant
      1 EFFECT
      4 synonymous_variant
      3 missense_variant
      1 EFFECT
      1 downstream_gene_variant
      6 missense_variant
      1 synonymous_variant
      1 EFFECT
      1 downstream_gene_variant
      3 missense_variant
      1 EFFECT
      1 downstream_gene_variant
      4 synonymous_variant
      4 missense_variant
      1 EFFECT
      2 missense_variant
      1 upstream_gene_variant

from this, I need the following result:

missense_variant  its total
downstream variant  its total
upstream variant  its total
....etc

I tried it but did find correct result. Can anyone please tell me how to do it in python or shell or any other language? Thanks in advance!

coding • 1.4k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 22 months ago by smrutimayipanda ▴ 20

0

Entering edit mode

What have you tried? This should be straightforward in awk. With R, this should be even simpler.

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

I tried with python but it was giving me total of all variants. Can you please tell me how to do it using awk?

ADD REPLY • link 22 months ago by smrutimayipanda ▴ 20

0

Entering edit mode

What did you try with python? Did you make a dict from column two and then sum column 1 for each unique column 2 key?

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

I did this:

data = {}

with open('sorted_effect_distribution.txt', 'r') as f:
    for line in f:
        name, value = line.strip().split()
        if name in data:
            data[name] += int(value[0])
        else:
            data[name] = int(value[0])

for name, value in data.items():
    print(f"{name}: {value}")

ADD REPLY • link updated 22 months ago by Ram 44k • written 22 months ago by smrutimayipanda ▴ 20

0

Entering edit mode

Please give the command in awk. It would be really helpful.

ADD REPLY • link 22 months ago by smrutimayipanda ▴ 20

0

Entering edit mode

No. It's a good exercise for you. Search online on how to use awk dictionaries.

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

Please let others comment on this. Thanks for your time.

ADD REPLY • link 22 months ago by smrutimayipanda ▴ 20

0

Entering edit mode

I'm not stopping anyone from commenting - most people are ignoring the post, I'm simply taking the time to tell you that you're better off following a certain path.

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

You have the columns inverted - shouldn't you be doing value, name = line.strip().split()?

ADD REPLY • link 22 months ago by Ram 44k