Question

Remove rows based on duplicates in one column

0

Entering edit mode

8.8 years ago

thomasterTW ▴ 20

I'm doing RNAseq analysis to determine differential expression patterns. After running a custom R script to determine the cut off between expressed and unexpressed genes (kernel density plot), I get a list with 4 columns.

Ensembl ID          Short gene name     Locus                       10Log FPKM
ENSG00000201699     RNU1-59P            chr1:144534037-144534199    1.587315
ENSG00000215861     WI2-1896O14.1       chr1:144676873-144679969    0.975001
ENSG00000254539     RP4-791M13.3        chr1:144833167-144835867    -0.38691
ENSG00000225241     RP11-640M9.2        chr1:144593362-144621656    0.982576
ENSG00000203843     PFN1P2              chr1:144612265-144612683    -0.55001
ENSG00000235398     LINC00623           chr1:144275917-144311653    -0.44573 <-- dup
ENSG00000235398     LINC00623           chr1:144299757-144341756    0.524043 <-- dup
ENSG00000207106     RNVU1-4             chr1:144311212-144311376    0.949058
ENSG00000231360     AL592284.1          chr1:144339737-144521058    -0.37044
ENSG00000236943     RP11-640M9.1        chr1:144456137-144521970    0.434007

These lists averagely contain 25k genes. As you can see, there's a duplicate in this list (LINC00623). What I want to do is write a script (preferably perl or R, or maybe something like awk in the command line) to find these duplicates based on the Ensembl ID, and remove the line with the lowest FPKM (since it is the same gene averagely on the same locus but apparently cufflinks decides that it is expressed twice). I can write a script to determine this for my example, but the problem is that there can be genes in between the duplicates, so I need to find duplicates in the entire column. I haven't been able to figure this out so I really hope someone can help me.

Thanks in advance!

RNA-Seq duplicates • 2.8k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by thomasterTW ▴ 20

1

Entering edit mode

I would first identify why cufflinks decided that it is expressed twice...

If you do in the terminal:

cut -f 2 file.txt | uniq -d > duplicates

Then you know which "short gene names" are duplicates and you could identify a pattern or something and the number of them.

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

I'd agree with this, I guess it's something to do with the XLOC codes produced and how your script handles those. I'd also highly suggest that you try the htseq_count -> DESeq2 method for differential gene expression, if anything just to compare results.

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

8.8 years ago

george.ry ★ 1.2k

Sort the table by the last column:

df = df[order(-df$Log10_FPKM),]

Then remove duplicates based on first column:

df[!duplicated(df$Ensembl_ID),]

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by george.ry ★ 1.2k

0

Entering edit mode

8.8 years ago

Charles Plessy ★ 2.9k

Not in R, but have a look at bedtools groupby.

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

8.8 years ago

Veera ▴ 90

In R you can do by splitting the data frame using ensemblID column and extracting the rows with maximum FPKM and combining them in to a data frame.

dfm <- read.table(..)

dfm.lst <- split(dfm, dfm$EnsemblID)
dfm.lst.dupRemoved <- sapply(dfm.split, function(x) x[x$10LogFPKM == max(x$10LogFPKM), ])
dfm.dupRemoved <- do.call("rbind",df.lst.dupRemoved)

write.table(dfm.dupRemoved, "youNameIt", row.names = FALSE, quote = FALSE, sep = "\t")

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Veera ▴ 90

Ram · Accepted Answer · 2016-02-24

2

Entering edit mode

8.8 years ago

5heikki 11k

In shell:

export LC_ALL=C; sort -k1,1 -k4,4gr file | sort -mu -k1,1 > out

or with proper GNU sort

export LC_ALL=C; sort --parallel $NUMCORES(max 8) -k1,1 -k4,4gr file | sort -mu -k1,1 > out

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by 5heikki 11k

0

Entering edit mode

Thanks! This works very well

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by thomasterTW ▴ 20