How do I remove genes in a GFF file so that I only have genes related to glucose metabolism?
0
0
Entering edit mode
3.8 years ago
mxm189 • 0

Hi everyone,

Basicly the question is in the title. Basically, I have a file in galaxy that contains all mouse genes but I need to remove genes so that I only have the genes related to glucose metabolism left. Does anyone have a good script or any idea how I should go about this? I thought about building a different file from scratch, using GO annotation (AmiGo2 search) but I dont even know which extension or in what program I should save the downloaded data... and then convert it into GAF or something. But yeah clearly I'm new at this so, please if someone could help me out that would be awesome.

-Koen

rna-seq sequencing alignment • 2.1k views
ADD COMMENT
5
Entering edit mode

The hardest part is not the filtering, its getting genes related to "metabolism". Different people would probably give you different definitions for "genes related to metabolism".

ADD REPLY
0
Entering edit mode

This is definitely a valid point, I decided to further specify to only glucose metabolism related genes. I suppose this would narrow things quite a bit?

ADD REPLY
0
Entering edit mode

For what purpose do you need the extracted data for? Unless you need something specific, wouldn't using something like grep -f with a gene list you can manually curate from existing GO annotations (e.g. here) be enough?

ADD REPLY
0
Entering edit mode

Grep -f? Im a complete newbie, as in I normally only do wetlab work, sorry about that. I also found that list, but how do I use grep-f to filter on GO annotation if the file that I'm working with GTF(GFF) doesn't contain any GO annotations. Is manually searching for the genes the only way?

ADD REPLY
1
Entering edit mode

Download a file export from the page linked by @newbio17 above (use the text file button at top left) and save the file locally. OR This link may work for that purpose.

We are going to get the gene names from this file. There should be 326 genes.

awk -F "\t" '{print $2}' GO_term_summary_20210131_204312.txt | tail -n +2 > gene_names

Download the GTF file from GENCODE (if you don't have one already). gunzip the GTF file to uncompress it.

We will extract only the lines that are for "glucose metabolic" genes present in gene_names file using the following command. Please note that there are multiple transcripts for each gene.

cat gene_names | xargs -n 1 sh -c 'grep "$0" gencode.vM25.primary_assembly.annotation.gtf' > genes_of_interest.gtf

Please be patient, it may take some time to process the gene list. genes_of_interest.gtf will contain genes of interest.

ADD REPLY
0
Entering edit mode

Thank you, but I ran into some trouble... almost immediately. I tried running the command in Ubuntu, twice. But it says: awk: Fatal: Cannot open file 'GO_term_summary_210131_211109.txt' (mine is called that) for reading (No such file or directory). I have the file on my desktop and Ubuntu also is running on my desktop I believe: username@desktop-3UMHGAT.

ADD REPLY
0
Entering edit mode

Make sure you are in the correct directory. If you have the file on your Desktop then you will need to cd ~/Desktop first.

ADD REPLY
0
Entering edit mode

Same, no such file or directory :(. I have ubuntu installed for windows, I don't know why but maybe its because root = desktop?

ADD REPLY
0
Entering edit mode

It may be best if you spent some time learning basic unix command line. I recommend this guide for new users.

Once you figure out where the file is you should be able to do the steps I outline above.

ADD REPLY
0
Entering edit mode

Forgot to give an update on the situation. As soon as the guide mentioned directories I looked up some tutorials on youtube for Ubuntu on windows specifically and I used the commands you mentioned Genomax. Which worked ;) Thanks. Now all I need to do is make some heatmaps... time to start watching some tutorials again.

ADD REPLY
0
Entering edit mode

Here are the processes related to "glucose metabolic process" at AmiGO. You could filter them based on the organism you are interested in e.g. mouse to get the gene names.

ADD REPLY
0
Entering edit mode

Thank you, how do you get such an extensive list, if I search on glucose metabolism I end up with only around 30 genes.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6