Question

How to split text into columns using R

1

Entering edit mode

3.7 years ago

shubhamkumbhar420 ▴ 40

I have a text file in which the 14th column looks like these. This file is created using the Variant effect predictor(VEP).

IMPACT=MODIFIER;DISTANCE=1246;STRAND=-1;BIOTYPE=transcribed_pseudogene;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR
IMPACT=MODIFIER;STRAND=1;BIOTYPE=transcribed_pseudogene;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSc=NR_046018.2:n.464-105T>G;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR
IMPACT=MODIFIER;DISTANCE=4253;STRAND=-1;BIOTYPE=miRNA;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR

Now I want to put those values in each column. I have tried the text_to_column function from Excel, but it didn't work properly.

Thanks in advance!

vcf snp vep R • 2.7k views

ADD COMMENT • link updated 2.8 years ago by jena ▴ 320 • written 3.7 years ago by shubhamkumbhar420 ▴ 40

1

Entering edit mode

You can use --tab flag in VEP when annotating your files.

But, for your txt files you can use (not tested!)

sed 's/;/\t/g' your_file > your_tab_file

ADD REPLY • link 2.8 years ago by brunobsouzaa ▴ 840

0

Entering edit mode

I've posted and awk answer below, but of course you could do exactly the same with just cut and sed - maybe I'm getting too much into awk :facepalm:

ADD REPLY • link 2.8 years ago by jena ▴ 320

score 1 · Answer 1 · 2022-02-17

VEP can output a VCF file, using bcftools extract CSQ from INFO, then read into R with "|" as a delimiter:

# using bcftools get CSQ
bcftools query -f '%INFO/CSQ\n' myfile.vep.vcf > myfile.vep.vcf.csq

# then use R to read with a delimiter "|"
myCSQ <- read.table("myfile.vep.vcf.csq", sep = "|")

score 0 · Answer 2 · 2021-03-26

0

Entering edit mode

3.7 years ago

cpad0112 21k

VEP can output in text. You can use that. If you have VCF this way, try vcf2tsv function (https://github.com/vcflib/vcflib).

ADD COMMENT • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

I have output in txt format!

ADD REPLY • link 3.7 years ago by shubhamkumbhar420 ▴ 40

0

Entering edit mode

Each row has different KV pairs. Probably you may have to use VEP output specific extraction tools or you have to write a script to extract all keys and fill them with NA wherever values are not available for each record (row). See if you can use VEP filter tool to filter the values you want.

ADD REPLY • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

vcf2tsv did not work

ADD REPLY • link 3.7 years ago by shubhamkumbhar420 ▴ 40

0

Entering edit mode

if output is in text format, vcf2tsv doesn't work.

ADD REPLY • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

see if this is what you want:

$ mlr --d2p  --ifs ";"  unsparsify --fill-with "NA" file.txt                                                                                                                         

IMPACT   DISTANCE STRAND BIOTYPE                REFSEQ_MATCH    GIVEN_REF USED_REF HGVSg           AF     AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF MAX_AF MAX_AF_POPS HGVSc
MODIFIER 1246     -1     transcribed_pseudogene rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NA
MODIFIER NA       1      transcribed_pseudogene rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NR_046018.2:n.464-105T>G
MODIFIER 4253     -1     miRNA                  rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NA

Miller is in ubuntu repos.

ADD REPLY • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your response but it didn't work. I have a file much bigger than these eg. That's why I need a script for R

ADD REPLY • link 3.7 years ago by shubhamkumbhar420 ▴ 40

1

Entering edit mode

I can only work with the data furnished in OP. Sorry that solution didn't work out.

ADD REPLY • link 3.7 years ago by cpad0112 21k

score 0 · Answer 3 · 2022-02-17

Why not awk? Something like this could get you what you need:

awk '{print $14}' vep_data.txt | awk -F";" -v OFS="\t" '$1=$1' > vep_data_col14.tsv

The first part takes 14th column, the second part then splits it by a new field separator ";" and gives you columns separated by tab ("\t"). You can change that to use different column separator if you want. The $1=$1 is an awk idiom to force recalculation of fields, so the OFS gets applied (see here for more).

Edit: If you need those terms before "=" (such as IMPACT etc) as the column headers and the values split in the correct columns, that could probably also be done, but I would have to think some more about that.