Question

Convert two-column txt to multiple column csv

3

Entering edit mode

6.1 years ago

misterie ▴ 110

Hi,

I need some advice. I have output file (count file) from VCF format. It looks like this:

Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176

and there are more chromosomes. I have two others files with another column names (but could differ with 1 or more fields between chromosomes). How can I convert that file to CSV or another file format. I mean, I want to create file like this:

Chr,protein_coding,pseudogene
10,447164,87457
11,368825,78131
12,357596,68176

Assuming that if some chromosome does not has for example pseudogene, than script will put empty field, e.g. for 15 chromosome:

15,132598,

Thank you in advance

bash python r conversion • 1.8k views

ADD COMMENT • link updated 6.1 years ago by cpad0112 21k • written 6.1 years ago by misterie ▴ 110

0

Entering edit mode

Thank you for explaining the problem at hand so well.

What have you tried by yourself to solve this problem? How far did you get and what specific challenges are you facing?

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

I have no idea, how can I do that...

ADD REPLY • link 6.1 years ago by misterie ▴ 110

1

Entering edit mode

You can use awk with RS=Chr. That will use Chr to create records, so each record would contain all data between consecutive Chrs. You can then replace each new line by a space and use $1, $2 etc to get to your result.

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

Also, please see the following oddities:

protein_coding becomes protein_codin
There's a blank space preceding pseudogene. Why?
Your example for "15 chromosome" has entry for Chr = 16
Your current dataset has nothing separating Chr (the key) from the chromosome number (the value). Is this true?
The comma-separated format you've shown is actually semi-colon separated. You do mention "or other", so that's fine I guess, but pick a separator.

These will make a difference in the final script you develop.

ADD REPLY • link 6.1 years ago by Ram 44k

0

Entering edit mode

sorry, my bad. I have corrected everything.

ADD REPLY • link 6.1 years ago by misterie ▴ 110

score 4 · Answer 1 · 2018-11-30

Using R dplyr package:

library(dplyr)
library(tidyr) # spread: reshape long-to-wide

# example data (using @RamRS's data)
df1 <- read.table(text = "Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176
Chr 13
protein_coding 447164
Chr 14
pseudogene 87457
Chr 15", header = FALSE, stringsAsFactors = FALSE)

write.csv(
  df1 %>% 
    mutate(grp = cumsum(V1 == "Chr")) %>% 
    spread(key = "V1", value = "V2") %>% 
    select(-grp),
  file = "myOutput.csv", na = "", row.names = FALSE, quote = FALSE)

# myOutput.csv
#
# Chr,protein_coding,pseudogene
# 10,447164,87457
# 11,368825,78131
# 12,357596,68176
# 13,447164,
# 14,,87457
# 15,,

score 1 · Answer 2 · 2018-11-30

Here's a simple bash-awk script I wrote:

$ cat in.txt

Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176
Chr 13
protein_coding 447164
Chr 14
pseudogene 87457
Chr 15

$ echo "Chr,protein_coding,pseudogene" && awk 'BEGIN{RS="Chr "} {gsub(/\n/," ",$0); print $0}' in.txt | awk -F" " 'NF==5 {print $1 "," $3 "," $5} NF==3 && $2=="protein_coding" {print $1 "," $3 ","} NF==3 && $2=="pseudogene" {print $1 ",," $3} NF==1 {print $1 ",,"}'

Chr,protein_coding,pseudogene
10,447164,87457
11,368825,78131
12,357596,68176
13,447164,
14,,87457
15,,

The code makes a lot of assumptions and is as unscalable and data-dependent as code gets, but it works. It assumes missing entries would not have corresponding key (first column) entries. It assumes Chr is mandatory. It assumes no blank spaces exist in values, and that the three listed keys are the only allowed keys. I hate this code, but it works.

score 1 · Answer 3 · 2018-11-30

I think this python script should do the trick, but I haven't tested it

import sys

data = dict(Chr= 'Chr', protein_coding= 'protein_coding', pseudogene = 'pseudogene')
for line in open(sys.argv[1]):
    if line.startswith('Chr'):
        print("{},{},{}".format(*data.values()))
        data = dict(Chr= '', protein_coding= '', pseudogene = '')
    data[line.split('\t')[0]] = line.strip().split('\t')[1]