How to split text into columns using R
3
1
Entering edit mode
3.7 years ago

I have a text file in which the 14th column looks like these. This file is created using the Variant effect predictor(VEP).

IMPACT=MODIFIER;DISTANCE=1246;STRAND=-1;BIOTYPE=transcribed_pseudogene;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR
IMPACT=MODIFIER;STRAND=1;BIOTYPE=transcribed_pseudogene;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSc=NR_046018.2:n.464-105T>G;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR
IMPACT=MODIFIER;DISTANCE=4253;STRAND=-1;BIOTYPE=miRNA;REFSEQ_MATCH=rseq_mrna_match;GIVEN_REF=T;USED_REF=T;HGVSg=chr1:g.13116T>G;AF=0.0971;AFR_AF=0.0295;AMR_AF=0.121;EAS_AF=0.0248;EUR_AF=0.1869;SAS_AF=0.1534;MAX_AF=0.1869;MAX_AF_POPS=EUR

Now I want to put those values in each column. I have tried the text_to_column function from Excel, but it didn't work properly.

Thanks in advance!

vcf snp vep R • 2.6k views
ADD COMMENT
1
Entering edit mode

You can use --tab flag in VEP when annotating your files.

But, for your txt files you can use (not tested!)

sed 's/;/\t/g' your_file > your_tab_file 
ADD REPLY
0
Entering edit mode

I've posted and awk answer below, but of course you could do exactly the same with just cut and sed - maybe I'm getting too much into awk :facepalm:

ADD REPLY
1
Entering edit mode
2.8 years ago
zx8754 12k

VEP can output a VCF file, using bcftools extract CSQ from INFO, then read into R with "|" as a delimiter:

# using bcftools get CSQ
bcftools query -f '%INFO/CSQ\n' myfile.vep.vcf > myfile.vep.vcf.csq

# then use R to read with a delimiter "|"
myCSQ <- read.table("myfile.vep.vcf.csq", sep = "|")
ADD COMMENT
0
Entering edit mode
3.7 years ago

VEP can output in text. You can use that. If you have VCF this way, try vcf2tsv function (https://github.com/vcflib/vcflib).

ADD COMMENT
0
Entering edit mode

I have output in txt format!

ADD REPLY
0
Entering edit mode

Each row has different KV pairs. Probably you may have to use VEP output specific extraction tools or you have to write a script to extract all keys and fill them with NA wherever values are not available for each record (row). See if you can use VEP filter tool to filter the values you want.

ADD REPLY
0
Entering edit mode

vcf2tsv did not work

ADD REPLY
0
Entering edit mode

if output is in text format, vcf2tsv doesn't work.

ADD REPLY
0
Entering edit mode

see if this is what you want:

$ mlr --d2p  --ifs ";"  unsparsify --fill-with "NA" file.txt                                                                                                                         

IMPACT   DISTANCE STRAND BIOTYPE                REFSEQ_MATCH    GIVEN_REF USED_REF HGVSg           AF     AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF MAX_AF MAX_AF_POPS HGVSc
MODIFIER 1246     -1     transcribed_pseudogene rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NA
MODIFIER NA       1      transcribed_pseudogene rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NR_046018.2:n.464-105T>G
MODIFIER 4253     -1     miRNA                  rseq_mrna_match T         T        chr1:g.13116T>G 0.0971 0.0295 0.121  0.0248 0.1869 0.1534 0.1869 EUR         NA

Miller is in ubuntu repos.

ADD REPLY
0
Entering edit mode

Thank you for your response but it didn't work. I have a file much bigger than these eg. That's why I need a script for R

ADD REPLY
1
Entering edit mode

I can only work with the data furnished in OP. Sorry that solution didn't work out.

ADD REPLY
0
Entering edit mode
2.8 years ago
jena ▴ 320

Why not awk? Something like this could get you what you need:

awk '{print $14}' vep_data.txt | awk -F";" -v OFS="\t" '$1=$1' > vep_data_col14.tsv

The first part takes 14th column, the second part then splits it by a new field separator ";" and gives you columns separated by tab ("\t"). You can change that to use different column separator if you want. The $1=$1 is an awk idiom to force recalculation of fields, so the OFS gets applied (see here for more).

Edit: If you need those terms before "=" (such as IMPACT etc) as the column headers and the values split in the correct columns, that could probably also be done, but I would have to think some more about that.

ADD COMMENT

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6