How to determine partial or complete gene overlap.
3
0
Entering edit mode
8.5 years ago
hellbio ▴ 520

Dear all,

I have the gene annotations in gtf format as shown below:

chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1";exon_id "ENPP1.1";
chr1    stdin   CDS     247832  252562  .       -       0       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1"; exon_id "ENPP1.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   CDS     254466  254628  .       -       1       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   CDS     247829  252562  .       -       1       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";
chr1    stdin   CDS     254466  254628  .       -       2       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";

gene_id is duplicated for each transcript with a suffix "_1" , "_2".... Are there any tools to determine whether there is a partial or complete gene overlap with few Kb segments using the gene file in the above format?

next-gen • 2.5k views
ADD COMMENT
1
Entering edit mode
8.5 years ago
EagleEye 7.6k
echo `date`;
dir=(`dirname $1`)

base=(`basename $1`)

echo "1/7 Preparing files";


cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

echo "2/7 sorting and joining files - Transcripts (Not applicable)";


echo "3/7 sorting and joining files - Genes";
### ENSG

sort -k1,1 ${dir}/ensg_start.tmp > ${dir}/ensg_start_sort.tmp;
sort -k1,1  ${dir}/ensg_end.tmp > ${dir}/ensg_end_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp > ${dir}/ensg_location.tmp


sort -k1,1 ${dir}/ensg_annotation.tmp > ${dir}/ensg_annotation_sort.tmp;
sort -k1,1 ${dir}/ensg_location.tmp > ${dir}/ensg_location_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_annotation_sort.tmp ${dir}/ensg_location_sort.tmp > ${dir}/ensg_annotation.txt;


echo "4/7 Distance measure - Transcripts (Not applicable)";

echo "5/7 Distance measure - Genes";

cat $dir/ensg_annotation.txt | awk 'function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}function min(x){i=max(x);for(val in x){if(i>x[val] && x[val]>1){i=x[val];}}return i;}{split($9,a,",") ; split($10,b,","); print $0"\t"min(a)"\t"max(b);}' | awk 'BEGIN{FS="\t"}{print $2"\t"$11"\t"$12"\t"$1"\t0\t"$5}' >  ${dir}/${base}_ensg_annotation.txt;

echo "6/7 Cleaning temporary files";

rm $dir/ensg_annotation.tmp $dir/ensg_start.tmp $dir/ensg_end.tmp $dir/ensg_annotation.txt $dir/ensg_location.tmp ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp ${dir}/ensg_location_sort.tmp ${dir}/ensg_annotation_sort.tmp;

echo "7/7 Done";
echo `date`;
ADD COMMENT
0
Entering edit mode

Just have to tweak a little for non standard Ensembl. The above code will work for your test.genes. it outputs the bed file as below

chr1    247829  273988  ENPP1   0   -
chr1    247829  313574  ENPP1_2 0   -
chr1    247829  270611  ENPP1_3 0   -
ADD REPLY
0
Entering edit mode
8.5 years ago
EagleEye 7.6k

This might be useful,

https://github.com/bedops/bedops#Dependencies

https://bedops.readthedocs.io/en/latest/

Or convert gff to bed (gene level) then use bed tools

There is a simple script I wrote to convert GTFtoBED, use similar method.

A: Converting gtf format to bed format

I have noticed that your gff file is almost similar to ensembl format (above code only works for gencode), use this,

Script:

Sample GTF:

Sample output where you can easily get the bed columns:

ADD COMMENT
0
Entering edit mode

The script usage message shows: usage: dirname path usage: basename string [suffix] basename [-a] [-s suffix] string [...]

What does dirname mean? could it be the gtf file

ADD REPLY
0
Entering edit mode

Use,

./script.sh path_to_gtf/INPUT.GTF

Tips: make the script executable first using

chmod a+x script.sh

ADD REPLY
0
Entering edit mode

Please see below post

ADD REPLY
0
Entering edit mode
8.5 years ago
hellbio ▴ 520

I did used the script as ./EnsemblGTFtoAnnotation.sh genes.gtf . However, the output looks weird and looks like below

3BHS_CANFA  chr17           +   3BHS_CANFA.1 56552945,56552945,56553120,
41_CANFA    chr2            -   41_CANFA.1  71472762,71472762,71476569,71476569,71479564,
41_CANFA_10 chr2            -   41_CANFA_10.1    71472762,71472765,71476569,71476569,71479564,71479564
41_CANFA_11 chr2            -   41_CANFA_11.1    71472762,71472762,71476569,71476569,71479564,71479564,71480800,
41_CANFA_12 chr2            -   41_CANFA_12.1   71472762,71472762,71476569,71476569,71479564,71479564,
ADD COMMENT
0
Entering edit mode

Please edit and delete 18,22 from cut command in these lines

echo "1/7 Preparing files"; cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$6]++' | awk '{print $6"\t"$0;}' > $dir/enst_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_end.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

ADD REPLY
0
Entering edit mode

Unfortunately,still the same!! Uploaded a sample file for one gene with few lines (http://www.fileconvoy.com/dfl.php?id=g9e89b42792cd42a1999835989774c2d90ccf95918) if it helps!!

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6