Question

Remove 2nd colon and rest of the values in a dataframe using R/ Unix

1

Entering edit mode

3.8 years ago

salman_96 ▴ 70

Hi

I have a file with coordinates like this

1:834573:A:AT
1:834830:G:A
1:835092:T:G
1:842388:T:TCCGCAGGA

I want to remove second colon and everything after that, such that the file looks like this

I have tried sed but the files have uneven characters.

Kindly suggest something

coordinates SNP Unix R • 1.4k views

ADD COMMENT • link updated 3.8 years ago by zx8754 12k • written 3.8 years ago by salman_96 ▴ 70

score 3 · Answer 1 · 2021-04-28

3

Entering edit mode

3.8 years ago

4galaxy77 2.9k

No need to use R for this - it's exactly what the cut command in unix was designed for.

❯ cat file.txt | cut -d':' -f1-2
1:834573
1:834830
1:835092
1:842388

ADD COMMENT • link 3.8 years ago by 4galaxy77 2.9k

score 1 · Answer 2 · 2021-04-28

1

Entering edit mode

3.8 years ago

cpad0112 21k

awk -v OFS="\t" -F ":" '{print $1,$2}' test.txt

ADD COMMENT • link 3.8 years ago by cpad0112 21k

score 1 · Answer 3 · 2021-04-28

1

Entering edit mode

3.8 years ago

benformatics 4.1k

R code that assumes your coordinates are in file test.txt. I also kept your empty spacer lines.

gsub("(\\d+\\:\\d+)\\:[AGCT]+\\:[AGCT]+","\\1",readLines('test.txt'))
[1] "1:834573" ""         "1:834830" ""         "1:835092" ""         "1:842388"

If you want to save it as a new file just wrap it with writeLines(con="new_file_name.txt") :

writeLines(gsub("(\\d+\\:\\d+)\\:[AGCT]+\\:[AGCT]+","\\1",readLines('test.txt')),con='output_test.txt')

ADD COMMENT • link 3.8 years ago by benformatics 4.1k

0

Entering edit mode

Actually you could even simplify it to:

gsub("(\\d+\\:\\d+).*","\\1",readLines('test.txt'))

ADD REPLY • link 3.8 years ago by benformatics 4.1k

0

Entering edit mode

We can use ":" as delimiter, and avoid regex:

write.table(
  read.table(text = "
             1:834573:A:AT
             1:834830:G:A
             1:835092:T:G
             1:842388:T:TCCGCAGGA", sep = ":")[, 1:2],
  file = "out.txt", col.names = FALSE, row.names = FALSE, 
  quote = FALSE, sep = ":")

ADD REPLY • link 3.8 years ago by zx8754 12k