Question

Easy way to filter gene list based on chromosome of origin?

0

Entering edit mode

8.0 years ago

steve ★ 3.5k

I have an RNA-Seq counts table (CSV file) with a column of gene names, which looks like this:

> head(counts_df[1:4])
                C-A       C-B        CD-A       CD-B
0610007P14Rik 1095.858  670.193    1583.706    1925.579
0610009B22Rik  360.416  321.309     413.633     343.594
0610009E02Rik    4.870    9.591      16.999      11.786
0610009L18Rik    6.494    0.000      13.221      28.104
0610009O20Rik  923.768 1246.872     783.826     834.055
0610010F05Rik 1558.554  895.589     898.094     706.227

I want to filter out all the genes that are on the X or Y chromosome. Problem is, the table does not list the chromosome for each gene.

My original solution was to download this file:

hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz

which looks like this:

NM_001271498    chr2    11705292    11733985    Il15ra
NM_001285857    chr7    142434976   142440396   Syt8
NM_011811   chr1    78424744    78488897    Farsb
NM_011706   chr11   62574485    62600305    Trpv2

And then in R, filter out entries on alternate chromosomes (chrUn_JH584304, chr4_GL456350_random, etc.), then filter out duplicated entries, and merge it back to the counts table to filter out genes on chrX and chrY.

However, its turned into a bit of a mess since now I am finding that there are still duplicates genes being listed;

> chrom_ref_df[which(duplicated(chrom_ref_df[["gene"]])), ]
      chrom          gene
18637  chr4      BC002163
29481  chrX           Bc1
16895 chr17          Btg3
14431  chr4         Eno1b
35707  chrY G530011O06Rik
40    chr11        Gm1821
12870 chr19        Gm5512
29305  chrX        Gm5643
19424  chr4        Gm5801
20054 chr12     Mir1906-1
29187 chr12     Mir1906-2
25432  chr6      Mir1957a
34207  chrX       Mir3472
23305  chrX      Mir3473a
13558 chr11       Mir5098
34197  chr2       Mir5098
34198  chr5       Mir5098
34200  chr8       Mir5098
13956  chr2      Mir684-1
13957  chr5      Mir684-1
25593 chr16      Mir684-1
28767 chr10      Mir684-1
34070  chr7      Mir684-1
34071  chrX      Mir684-1

This eventually ends up causing problems when trying to merge the chrom data back to the counts table for other downstream filtering, and creates duplicated entries.

I am not sure if its worth trying to sort this out further; is there an easier way to accomplish the original task of filtering the genes in the counts table based on their chromosome? This is using the mm10 reference genome.

RNA-Seq • 2.5k views

ADD COMMENT • link updated 8.0 years ago by WouterDeCoster 47k • written 8.0 years ago by steve ★ 3.5k

score 0 · Answer 1 · 2016-12-02

0

Entering edit mode

8.0 years ago

Ron ★ 1.2k

I would suggest removing the duplicate entries from the data if there are any. Then,You can remove the rows that have chrX,Y etc using grep

 new_df=df[!grepl("chrX",df$chrom),]

ADD COMMENT • link 8.0 years ago by Ron ★ 1.2k

score 0 · Answer 2 · 2016-12-02

0

Entering edit mode

8.0 years ago

WouterDeCoster 47k

I did something similar, but I used biomart from Ensembl (the website, although also R package is available) to download all genes on the X and Y chromosome and just removed those from the countdata.

ADD COMMENT • link 8.0 years ago by WouterDeCoster 47k