I'm doing RNAseq analysis to determine differential expression patterns. After running a custom R script to determine the cut off between expressed and unexpressed genes (kernel density plot), I get a list with 4 columns.
Ensembl ID Short gene name Locus 10Log FPKM
ENSG00000201699 RNU1-59P chr1:144534037-144534199 1.587315
ENSG00000215861 WI2-1896O14.1 chr1:144676873-144679969 0.975001
ENSG00000254539 RP4-791M13.3 chr1:144833167-144835867 -0.38691
ENSG00000225241 RP11-640M9.2 chr1:144593362-144621656 0.982576
ENSG00000203843 PFN1P2 chr1:144612265-144612683 -0.55001
ENSG00000235398 LINC00623 chr1:144275917-144311653 -0.44573 <-- dup
ENSG00000235398 LINC00623 chr1:144299757-144341756 0.524043 <-- dup
ENSG00000207106 RNVU1-4 chr1:144311212-144311376 0.949058
ENSG00000231360 AL592284.1 chr1:144339737-144521058 -0.37044
ENSG00000236943 RP11-640M9.1 chr1:144456137-144521970 0.434007
These lists averagely contain 25k genes. As you can see, there's a duplicate in this list (LINC00623
). What I want to do is write a script (preferably perl or R, or maybe something like awk in the command line) to find these duplicates based on the Ensembl ID, and remove the line with the lowest FPKM (since it is the same gene averagely on the same locus but apparently cufflinks decides that it is expressed twice). I can write a script to determine this for my example, but the problem is that there can be genes in between the duplicates, so I need to find duplicates in the entire column. I haven't been able to figure this out so I really hope someone can help me.
Thanks in advance!
I would first identify why cufflinks decided that it is expressed twice...
If you do in the terminal:
Then you know which "short gene names" are duplicates and you could identify a pattern or something and the number of them.
I'd agree with this, I guess it's something to do with the XLOC codes produced and how your script handles those. I'd also highly suggest that you try the htseq_count -> DESeq2 method for differential gene expression, if anything just to compare results.