Question

Replace <*> column with nucleotide in python or R or Shell

1

Entering edit mode

6.8 years ago

Kritika ▴ 270

I have data of 55000000 rows i want to replace <*> with preceding 3rd column nucleotide.

Format of data is

12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>

Output

12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   A,C,C,C

Nucleotide Replace Python R Shell • 2.9k views

ADD COMMENT • link updated 6.8 years ago by Bastien Hervé 6.0k • written 6.8 years ago by Kritika ▴ 270

1

Entering edit mode

Can you post something you have already tried?

ADD REPLY • link 6.8 years ago by Sej Modha 5.3k

0

Entering edit mode

Is your example correct for the last line ?

12 1109776 C A,C,C,<*>

This should turn to :

12 1109776 C A,C,C,C

not

12 1109776 C C,A,C,C

Am I correct ?

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.0k

0

Entering edit mode

Yes... It should turn to A,C,C,C

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

Updated the same again with actual required out put

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

Always add some detail on the effort you put in to solving your problem.

ADD REPLY • link 6.8 years ago by Ram 44k

score 3 · Answer 1 · 2018-03-20

Assuming you have exactly the format you posted here, following oneliner should work

Input.txt

2       1109770 C       <*>
12      1109771 T       <*>
12      1109772 T       <*>
12      1109773 T       <*>
12      1109774 C       <*>
12      1109775 C       <*>
12      1109776 C       <*>,A,C,C

Oneliner

cat input.txt | sed '/^$/d' | sed -e 's/<//' -e 's/>//' -e 's/\*/X/' -e 's/,/\t/' | awk '{print $1 "\t" $2 "\t" $3 "\t" $3 ","$5}' | sed 's/,$//'

Output

2       1109770 C       C
12      1109771 T       T
12      1109772 T       T
12      1109773 T       T
12      1109774 C       C
12      1109775 C       C
12      1109776 C       C,A,C,C

P.S: This might not work if <*> is not at the beginning of the 4th column (tab separated).

score 3 · Answer 2 · 2018-03-20

3

Entering edit mode

6.8 years ago

Bastien Hervé 6.0k

awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt

ADD COMMENT • link 6.8 years ago by Bastien Hervé 6.0k

2

Entering edit mode

Your solution looks like something I would have written a few years ago. The cat is almost always useless, also we don't need sed here :)

awk 'BEGIN{OFS=FS="\t"}{$4=$3$4; gsub("[<*>]",""); print $0}' input > output

ADD REPLY • link 6.8 years ago by 5heikki 11k

0

Entering edit mode

$ awk 'FS=OFS="\t" {gsub("[<*>]","");$4= $3$4}1' test.txt 
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C

in bash:

$ paste  <(cut -f1-3 test.txt) <(paste -d "" <(cut -f3 test.txt) <(cut -f4 test.txt | cut --complement -c -3))
12  1109770 C   C
12  1109771 T   T
12  1109772 T   T
12  1109773 T   T
12  1109774 C   C
12  1109775 C   C
12  1109776 C   C,A,C,C

ADD REPLY • link 6.8 years ago by cpad0112 21k

0

Entering edit mode

Maybe because I am a few years younger. Thanks for the tips and gsub fonction :)

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.0k

0

Entering edit mode

This is giving me output for last line :- 12 1109776 C CACC,

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

For one of the line

12 975013 C T,A,<*>

output

12 975013 C CT,

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

2

Entering edit mode

Try this :

awk 'BEGIN{OFS=FS="\t"}{gsub(/<\*>/,$3); print $0}' input.txt > output.txt

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.0k

0

Entering edit mode

Yes it worked now . Thank You so much !!!!

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. You can (and should test all answers posted here) and accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY • link 6.8 years ago by GenoMax 148k

0

Entering edit mode

Notice how your example data did not include any lines with such format

ADD REPLY • link 6.8 years ago by 5heikki 11k

0

Entering edit mode

Sorry updated the post

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

in awk:

$ awk 'BEGIN{FS="\t"} {$4=$3","$4; gsub(/,<\*>/,"")}1' test.txt

in bash:

$ paste  <(cut -f1-3 test.txt) <( paste -d "," <(cut -f3 test.txt) <(cut -f4 test.txt) |  rev| cut -c 1-4 --complement | rev)

output:

12 1109770 C C
12 1109771 T T
12 1109772 T T
12 1109773 T T
12 1109774 C C
12 1109775 C C
12 1109776 C C,A,C,C

input

$ cat test.txt 
12  1109770 C   <*>
12  1109771 T   <*>
12  1109772 T   <*>
12  1109773 T   <*>
12  1109774 C   <*>
12  1109775 C   <*>
12  1109776 C   A,C,C,<*>

ADD REPLY • link 6.8 years ago by cpad0112 21k

score 1 · Answer 3 · 2018-03-20

1

Entering edit mode

6.8 years ago

zx8754 12k

Using R, data.table package for fast read and write:

library(data.table)

# fast read using data.table package
dt1 <- fread("input.txt")

# dt1
#      V1      V2 V3        V4
#   1:  2 1109770  C       <*>
#   2: 12 1109771  T       <*>
#   3: 12 1109772  T       <*>
#   4: 12 1109773  T       <*>
#   5: 12 1109774  C       <*>
#   6: 12 1109775  C       <*>
#   7: 12 1109776  C <*>,A,C,C

# update V4, remove "<*>", prefix with V3
dt1[ , V4 := paste0(V3, gsub("<*>", "", V4, fixed = TRUE)) ]

# dt1
#    V1      V2 V3      V4
# 1:  2 1109770  C       C
# 2: 12 1109771  T       T
# 3: 12 1109772  T       T
# 4: 12 1109773  T       T
# 5: 12 1109774  C       C
# 6: 12 1109775  C       C
# 7: 12 1109776  C C,A,C,C

# fast write, without names, quotes
fwrite(dt1, file = "output.txt", sep = "\t",
       col.names = FALSE, row.names = FALSE, quote = FALSE)

ADD COMMENT • link 6.8 years ago by zx8754 12k

0

Entering edit mode

I can't do it in R file is huge very large

ADD REPLY • link 6.8 years ago by Kritika ▴ 270

0

Entering edit mode

Using data.table package it should work. Also, your question title mentions R.

ADD REPLY • link 6.8 years ago by zx8754 12k

0

Entering edit mode

With data frame and stringr:

library(stringr)
df=read.csv("test.txt", stringsAsFactors = F, sep = "\t", header = F)
df$V4=str_replace_all(df$V4,"<\\*>", df$V3)

df

output in R:

   > df
      V1      V2 V3      V4
    1 12 1109770  C       C
    2 12 1109771  T       T
    3 12 1109772  T       T
    4 12 1109773  T       T
    5 12 1109774  C       C
    6 12 1109775  C       C
    7 12 1109776  C C,A,C,C

input in R:

   V1      V2 V3        V4
1 12 1109770  C       <*>
2 12 1109771  T       <*>
3 12 1109772  T       <*>
4 12 1109773  T       <*>
5 12 1109774  C       <*>
6 12 1109775  C       <*>
7 12 1109776  C A,C,C,<*>

ADD REPLY • link 6.8 years ago by cpad0112 21k