Perhaps the following might get you started.
For exploration purposes, I exported the refFlat
table from the UCSC Genome Browser as a GTF file and saved it somewhere I can find it:
$ wget https://dl.dropboxusercontent.com/u/31495717/refFlat.hg19.gtf.gz
I then extracted exons and converted the result to a BED file with BEDOPS gtf2bed
‡, and passed it to an awk
script that uses an associative array (hash table) to store results based on gene name:
$ gzcat refFlat.hg19.gtf.gz \
| grep exon \
| gtf2bed - \
| awk '{ \
name[$4]++; \
if (name[$4] == 1) { \
chr[$4] = $1; \
start[$4] = $2; \
stop[$4] = $3; \
remainder[$4] = substr($0, index($0, $5)); \
} \
else { \
stop[$4] = $3; \
} \
} \
END { \
for (id in name) { \
printf("%s\t%s\t%s\t%s\t%s\n"), chr[id], start[id], stop[id], id, remainder[id]; \
} \
}' - \
> mergedRefFlatExons.hg19.bed
At the first instance of an exon for a gene name, we assign values to elements of associative arrays for the gene name. Where we find two or more exons, the else
condition of the if-else
block changes the stop position for that gene to the stop position of the last current exon.
If this works for you, then perhaps you can extend it to meet the other condition (reference transcript with the longest exon) by keeping track of the longest current exon, which you might mark in the END
block (perhaps with a custom GTF attribute printed at the end of the line).
‡ : Conversion to BED is not necessary. I did this because I am more familiar with handling BED data than GTF. If you are more familiar with GTF files and in which order attributes are stored, you can change the field assignments to elements of each associate array accordingly.
Can you provide a snippet of input, or what you expect to feed into a script? Just a few lines would help.
I just want to handle the refflat.gtf file: chrX hg19_refFlat exon 120073834 120073989 0.000000 - . gene_id "CT47A3"; transcript_id "CT47A3_dup2";
Hi camelbbs,Were you able to solve this problem?I also want to have overlapping exons merged for a gene from a gtf file.
Hi Camelbbs. I'm adding this comment to all your questions: Please take some time, before you ask a question, to think more about your problems and most likely sources of answers (manuals, FAQs, Google!, etc.). When you ask a question, include some context, tell us why you ask that question, what result you need, etc. Most of your questions are vague, impossible to answer or you changed them following an answer because it became evident that it was not clear. Cheers.
Are you familiar with this area or I don't know why you say this.