Replace <*> column with nucleotide in python or R or Shell
4
1
Entering edit mode
6.7 years ago
Kritika ▴ 270

I have data of 55000000 rows i want to replace <*> with preceding 3rd column nucleotide.

Format of data is

12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>

Output

12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   A,C,C,C
Nucleotide Replace Python R Shell • 2.8k views
ADD COMMENT
1
Entering edit mode

Can you post something you have already tried?

ADD REPLY
0
Entering edit mode

Is your example correct for the last line ?

12 1109776 C A,C,C,<*>

This should turn to :

12 1109776 C A,C,C,C

not

12 1109776 C C,A,C,C

Am I correct ?

ADD REPLY
0
Entering edit mode

Yes... It should turn to A,C,C,C

ADD REPLY
0
Entering edit mode

Updated the same again with actual required out put

ADD REPLY
0
Entering edit mode

Always add some detail on the effort you put in to solving your problem.

ADD REPLY
3
Entering edit mode
6.7 years ago
venu 7.1k

Assuming you have exactly the format you posted here, following oneliner should work

Input.txt

2       1109770 C       <*>
12      1109771 T       <*>
12      1109772 T       <*>
12      1109773 T       <*>
12      1109774 C       <*>
12      1109775 C       <*>
12      1109776 C       <*>,A,C,C

Oneliner

cat input.txt | sed '/^$/d' | sed -e 's/<//' -e 's/>//' -e 's/\*/X/' -e 's/,/\t/' | awk '{print $1 "\t" $2 "\t" $3 "\t" $3 ","$5}' | sed 's/,$//'

Output

2       1109770 C       C
12      1109771 T       T
12      1109772 T       T
12      1109773 T       T
12      1109774 C       C
12      1109775 C       C
12      1109776 C       C,A,C,C

P.S: This might not work if <*> is not at the beginning of the 4th column (tab separated).

ADD COMMENT
3
Entering edit mode
6.7 years ago
awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD COMMENT
2
Entering edit mode

Your solution looks like something I would have written a few years ago. The cat is almost always useless, also we don't need sed here :)

awk 'BEGIN{OFS=FS="\t"}{$4=$3$4; gsub("[<*>]",""); print $0}' input > output
ADD REPLY
0
Entering edit mode
$ awk 'FS=OFS="\t" {gsub("[<*>]","");$4= $3$4}1' test.txt 
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C

in bash:

$ paste  <(cut -f1-3 test.txt) <(paste -d "" <(cut -f3 test.txt) <(cut -f4 test.txt | cut --complement -c -3))
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C
ADD REPLY
0
Entering edit mode

Maybe because I am a few years younger. Thanks for the tips and gsub fonction :)

ADD REPLY
0
Entering edit mode

This is giving me output for last line :- 12 1109776 C CACC,

ADD REPLY
0
Entering edit mode

For one of the line

12 975013 C T,A,<*>

output

12 975013 C CT,

ADD REPLY
2
Entering edit mode

Try this :

awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt
ADD REPLY
0
Entering edit mode

Yes it worked now . Thank You so much !!!!

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. You can (and should test all answers posted here) and accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Notice how your example data did not include any lines with such format

ADD REPLY
0
Entering edit mode

Sorry updated the post

ADD REPLY
0
Entering edit mode

in awk:

$ awk 'BEGIN{FS="\t"} {$4=$3","$4; gsub(/,<\*>/,"")}1' test.txt

in bash:

$ paste  <(cut -f1-3 test.txt) <( paste -d "," <(cut -f3 test.txt) <(cut -f4 test.txt) |  rev| cut -c 1-4 --complement | rev)

output:

12 1109770 C C
12 1109771 T T
12 1109772 T T
12 1109773 T T
12 1109774 C C
12 1109775 C C
12 1109776 C C,A,C,C

input

$ cat test.txt 
12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>
ADD REPLY
1
Entering edit mode
6.7 years ago
zx8754 12k

Using R, data.table package for fast read and write:

library(data.table)

# fast read using data.table package
dt1 <- fread("input.txt")

# dt1
#      V1      V2 V3        V4
#   1:  2 1109770  C       <*>
#   2: 12 1109771  T       <*>
#   3: 12 1109772  T       <*>
#   4: 12 1109773  T       <*>
#   5: 12 1109774  C       <*>
#   6: 12 1109775  C       <*>
#   7: 12 1109776  C <*>,A,C,C

# update V4, remove "<*>", prefix with V3
dt1[ , V4 := paste0(V3, gsub("<*>", "", V4, fixed = TRUE)) ]

# dt1
#    V1      V2 V3      V4
# 1:  2 1109770  C       C
# 2: 12 1109771  T       T
# 3: 12 1109772  T       T
# 4: 12 1109773  T       T
# 5: 12 1109774  C       C
# 6: 12 1109775  C       C
# 7: 12 1109776  C C,A,C,C

# fast write, without names, quotes
fwrite(dt1, file = "output.txt", sep = "\t",
       col.names = FALSE, row.names = FALSE, quote = FALSE)
ADD COMMENT
0
Entering edit mode

I can't do it in R file is huge very large

ADD REPLY
0
Entering edit mode

Using data.table package it should work. Also, your question title mentions R.

ADD REPLY
0
Entering edit mode

With data frame and stringr:

library(stringr)
df=read.csv("test.txt", stringsAsFactors = F, sep = "\t", header = F)
df$V4=str_replace_all(df$V4,"<\\*>", df$V3)

df

output in R:

   > df
      V1      V2 V3      V4
    1 12 1109770  C       C
    2 12 1109771  T       T
    3 12 1109772  T       T
    4 12 1109773  T       T
    5 12 1109774  C       C
    6 12 1109775  C       C
    7 12 1109776  C C,A,C,C

input in R:

   V1      V2 V3        V4
1 12 1109770  C       <*>
2 12 1109771  T       <*>
3 12 1109772  T       <*>
4 12 1109773  T       <*>
5 12 1109774  C       <*>
6 12 1109775  C       <*>
7 12 1109776  C A,C,C,<*>
ADD REPLY

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6