Question

How to remove Mitochondrial genes from Human annotation file (.GTF)?

1

Entering edit mode

10.6 years ago

M K ▴ 660

Hi All,

I want to remove Mitochondrial genes from Human annotation file (.GTF)

next-gen-sequencing RNA-Seq R • 7.2k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by M K ▴ 660

0

Entering edit mode

grep -v '^ChrM' <your.gtf>

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

'^ChrM'

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

edited thanks, just trying to indicate that grep -v would work in this case

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Manvendra Singh ★ 2.2k

Ram · Answer 1 · 2014-12-09

0

Entering edit mode

10.6 years ago

GouthamAtla 12k

If the GTF is from Ensembl:

grep -v "^MT" genes.gtf > genes_noMT.gtf

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

1

Entering edit mode

I don't recommend doing this as it would remove lines containinghavana_transcript ID (e.g, OTTHUMT00000002421.3). Also what about genes containing MT in gene_name?

As you posted before grep -v '^chrM' should work.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by PoGibas 5.1k

1

Entering edit mode

Ensemble GTF do not have 'Chr' prefix. They are simple named as 1,2,3...MT,X,Y. The symbol '^' will not find the havana_transcript as it looks for MT only at the beginning of the line.

ADD REPLY • link 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

I tried both grep commands on ensemble annotation file release 37.75 (Homo_sapiens.GRCh37.75), and when I used wc -l to count the #of lines in gtf file, I noticed that: in the original gtf file there are 2828317 Homo_sapiens.GRCh37.75.gtf

but when I used grep -v "^MT", I found there is a decreasing of the lines# as shown below;

grep -v "^MT" Homo_sapiens.GRCh37.75.gtf > Homo_sapiens.GRCh37.75_noMT.gtf
wc -l Homo_sapiens.GRCh37.75_noMT.gtf
2828173  Homo_sapiens.GRCh37.75_noMT.gtf

While using grep -v "^ChrM", I found that line# in the original file same as when I used grep -v "^ChrM"

grep -v "^ChrM" Homo_sapiens.GRCh37.75.gtf > Homo_sapiens.GRCh37.75_noMT_M.gtf

Could any one explain that.

2828317 Homo_sapiens.GRCh37.75_noMT_M.gtf

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by M K ▴ 660

0

Entering edit mode

If you have Ensembl GTF, use grep -v "^MT". We are matching a pattern using grep and removing the lines which has those pattern. Here it depends on how the mitochondrial genes are represented. Ensemble represents them as MT, and other sources represent as ChrM. As you are using ensemble, the pattern '^ChrM' is not resulting in any matches, hence the number of lines remains same.

Read some tut to understand grep. Here is the one http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics#grep

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks a lot Geek and Manu.

ADD REPLY • link 10.6 years ago by M K ▴ 660