I have an RNA-Seq counts table (CSV file) with a column of gene names, which looks like this:
> head(counts_df[1:4])
C-A C-B CD-A CD-B
0610007P14Rik 1095.858 670.193 1583.706 1925.579
0610009B22Rik 360.416 321.309 413.633 343.594
0610009E02Rik 4.870 9.591 16.999 11.786
0610009L18Rik 6.494 0.000 13.221 28.104
0610009O20Rik 923.768 1246.872 783.826 834.055
0610010F05Rik 1558.554 895.589 898.094 706.227
I want to filter out all the genes that are on the X or Y chromosome. Problem is, the table does not list the chromosome for each gene.
My original solution was to download this file:
hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz
which looks like this:
NM_001271498 chr2 11705292 11733985 Il15ra
NM_001285857 chr7 142434976 142440396 Syt8
NM_011811 chr1 78424744 78488897 Farsb
NM_011706 chr11 62574485 62600305 Trpv2
And then in R, filter out entries on alternate chromosomes (chrUn_JH584304
, chr4_GL456350_random
, etc.), then filter out duplicated entries, and merge it back to the counts table to filter out genes on chrX and chrY.
However, its turned into a bit of a mess since now I am finding that there are still duplicates genes being listed;
> chrom_ref_df[which(duplicated(chrom_ref_df[["gene"]])), ]
chrom gene
18637 chr4 BC002163
29481 chrX Bc1
16895 chr17 Btg3
14431 chr4 Eno1b
35707 chrY G530011O06Rik
40 chr11 Gm1821
12870 chr19 Gm5512
29305 chrX Gm5643
19424 chr4 Gm5801
20054 chr12 Mir1906-1
29187 chr12 Mir1906-2
25432 chr6 Mir1957a
34207 chrX Mir3472
23305 chrX Mir3473a
13558 chr11 Mir5098
34197 chr2 Mir5098
34198 chr5 Mir5098
34200 chr8 Mir5098
13956 chr2 Mir684-1
13957 chr5 Mir684-1
25593 chr16 Mir684-1
28767 chr10 Mir684-1
34070 chr7 Mir684-1
34071 chrX Mir684-1
This eventually ends up causing problems when trying to merge the chrom
data back to the counts table for other downstream filtering, and creates duplicated entries.
I am not sure if its worth trying to sort this out further; is there an easier way to accomplish the original task of filtering the genes in the counts table based on their chromosome? This is using the mm10 reference genome.