How to mach gene names and source (lincRNA, antisense, protein_coding,....) with specific list of genes id?
2
0
Entering edit mode
10.0 years ago
M K ▴ 660

I have a list of gene id's and I want to match those with gene names and source (lincRNA, antisense, protein_coding, ...) from Ensembl gtf file. For example this is a small part of the list as shown:

Gene id             strand
ENSG00000242959     1
ENSG00000160396     -1
ENSG00000229494     1
ENSG00000230262     -1
ENSG00000229240     -1
ENSG00000223569     1

I got help before by using awk command to match gene id with gene name, so how can we include the source with them

next-gen RNA-Seq R • 3.5k views
ADD COMMENT
0
Entering edit mode
10.0 years ago

Take only the column 1.

awk '{ print $1 }' input_list | sort | uniq > gene_names

Now take the gene names and grep against GTF file.

while read line; do grep $line genes.gtf; done < gene_names > gene_names.gtf

This will be a bit slower but does the job. If you want a super fast program, you may need to wait for a perl/Python script.

ADD COMMENT
0
Entering edit mode

Hi Geek,

Komal helped me by using the following awk command

awk '{                                
    for (i = 1; I <= NF; i++) {
        if ($i ~ /gene_id|gene_name/) {
            printf "%s ", $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt

and it works very well and I merged the result file with my file using R, So I wounder if we can add the source column in this command.

ADD REPLY
0
Entering edit mode

M K I have replied to you on the previous question. Also, do not duplicate your posts.

ADD REPLY
0
Entering edit mode
2.7 years ago
D. Puthier ▴ 350

If looking for something more readable you may alternatively use gtftk the CLI of pygtftk. The gene_list.txt contains one column with any identifiers of interest related to the target key ("transcript_id" in your case)

 gtftk select_by_key -k transcript_id -f gene_list.txt -i genes.gtf -V 1   > gene_list.gtf

Best

Disclaimer: I'm the pygtftk developper.

ADD COMMENT

Login before adding your answer.

Traffic: 1522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6