Hi,
I have parsed my vcf files containing SNPs as below
CHROMOSOME POSITION REF ALT SAMPLE
1 782112 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1026918 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1133283 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1431511 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1742395 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1864994 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1914766 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
But I have duplicate mutation because for example in this sample
~$ grep 152280536 file.txt
1 152280536 T C LP6008334-DNA_C02_vs_LP6008335-DNA_G01
1 152280536 T C LP6008334-DNA_C02_vs_LP6008335-DNA_G01
I am not sure in which step of data processing and how I could removing duplicated mutations.
Any help please
There are a million ways to do this that are a simple google away:
"unix remove duplicated lines"
e.g.
https://unix.stackexchange.com/questions/30173/how-to-remove-duplicate-lines-inside-a-text-file
Use bash uniq:
sort | uniq
is not really necessary whensort -u
is an option, I think.Sorry, but some of these solutions destroyed my file
For example
by
R says that
or the command flags non duplicates too
Invest some time in understanding what the options you’ve been given actually do. Don’t just blindly copy from the web.
sort
has a lot of capabilities when used well. You might wish to set sorting keys (columns to sort with) using the-k M,N
option. You can sort by columns that will be the same among identical rows and then use the-u
option to pick only unique lines.If you read in the file as a
data.frame
ordata.table
into R (!), you can useunique(my_dataframe)
, no need to sort, if I recall correctly (but it may take some time).Which function of R returns the error message; it's somewhat difficult to believe it's related to the sorting and uniq'ing though.
You are right
I was using
dndscv
r package by data underunique
or any suggested command for removing duplication that returned errorI have also tried this likely amended data by another python packages like OncodriveCLUSTL and OncodriveFML and returned error like
I do not see the usage of unique in your example.
Does that mean the error you're reporting ("X mutations have a wrong reference base") is independent of (any) of the unique commands? Did you check the help of
dndscv()
?Everything happened when I tried to remove duplicates; I run these package with original file successfully