Extract random number of transcript and exons from gencode annotation
2
0
Entering edit mode
6.9 years ago
samuel ▴ 260

Hi, I want to extract say 1000 transcripts and related exons from my gencode.gtf. Does anyone know how to do this? I'm a beginner and have used grep to extract 'transcripts' and 'exons' only but as there are not a set number of exons connected to the transcripts I'm not sure how to count to extract 1000 transcripts??

At the moment I have:

 grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript" || $3=="exon")print$0}'

Many thanks!

genome sequencing • 2.1k views
ADD COMMENT
1
Entering edit mode
6.9 years ago
samuel ▴ 260

Okay kind of worked out a dirty way of doing it so any more elegant solutions welcome. I found the id of the 1000th transcript and then used sed to print out everything before this id:

> grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript")print$0}'  > test.gtf
## to get the id of the 1000th transcript
 > cat test.gtf | sed -n -e '1000p'
### print everything from the first line to the 1000th transcript id
> sed -n '1,/<TRANSCRIPT_ID FROM THE PREVIOUS LINE OF CODE>/ p' test.gtf
ADD COMMENT
0
Entering edit mode
6.9 years ago
michael.ante ★ 3.9k

Hi Zoegward,

I think you need to extract 1000 transcript IDs from the gtf. Something like

awk '$3=="transcript" && match ($0, /protein_coding/){for(i=1;i<NF;i++){if(match($i,/transcript_id/)){print $(i+1)}}}' gencode.v27.gtf > transcript-list.txt

To extract all Transcript IDs from protein coding genes. In order to get 1000 IDs just uses head -n 1000, tail -n 1000, and if you want to shuffle it shuf it before using tail /head.

The resulting list is used for a grep search:

grep -f 1000-transcript-ids.txt gencode.v27.gtf > out.gtf

Cheers,

Michael

ADD COMMENT

Login before adding your answer.

Traffic: 1478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6