Question

How to determine partial or complete gene overlap.

0

Entering edit mode

8.5 years ago

hellbio ▴ 520

Dear all,

I have the gene annotations in gtf format as shown below:

chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1";exon_id "ENPP1.1";
chr1    stdin   CDS     247832  252562  .       -       0       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1"; exon_id "ENPP1.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   CDS     254466  254628  .       -       1       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   CDS     247829  252562  .       -       1       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";
chr1    stdin   CDS     254466  254628  .       -       2       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";

gene_id is duplicated for each transcript with a suffix "_1" , "_2".... Are there any tools to determine whether there is a partial or complete gene overlap with few Kb segments using the gene file in the above format?

next-gen • 2.5k views

ADD COMMENT • link updated 8.5 years ago by EagleEye 7.6k • written 8.5 years ago by hellbio ▴ 520

0

Entering edit mode

8.5 years ago

EagleEye 7.6k

This might be useful,

https://github.com/bedops/bedops#Dependencies

https://bedops.readthedocs.io/en/latest/

Or convert gff to bed (gene level) then use bed tools

There is a simple script I wrote to convert GTFtoBED, use similar method.

A: Converting gtf format to bed format

I have noticed that your gff file is almost similar to ensembl format (above code only works for gencode), use this,

Script:

Sample GTF:

Sample output where you can easily get the bed columns:

ADD COMMENT • link 8.5 years ago by EagleEye 7.6k

0

Entering edit mode

The script usage message shows: usage: dirname path usage: basename string [suffix] basename [-a] [-s suffix] string [...]

What does dirname mean? could it be the gtf file

ADD REPLY • link 8.5 years ago by hellbio ▴ 520

0

Entering edit mode

Use,

./script.sh path_to_gtf/INPUT.GTF

Tips: make the script executable first using

chmod a+x script.sh

ADD REPLY • link 8.5 years ago by EagleEye 7.6k

0

Entering edit mode

Please see below post

ADD REPLY • link 8.5 years ago by hellbio ▴ 520

0

Entering edit mode

8.5 years ago

hellbio ▴ 520

I did used the script as ./EnsemblGTFtoAnnotation.sh genes.gtf . However, the output looks weird and looks like below

3BHS_CANFA  chr17           +   3BHS_CANFA.1 56552945,56552945,56553120,
41_CANFA    chr2            -   41_CANFA.1  71472762,71472762,71476569,71476569,71479564,
41_CANFA_10 chr2            -   41_CANFA_10.1    71472762,71472765,71476569,71476569,71479564,71479564
41_CANFA_11 chr2            -   41_CANFA_11.1    71472762,71472762,71476569,71476569,71479564,71479564,71480800,
41_CANFA_12 chr2            -   41_CANFA_12.1   71472762,71472762,71476569,71476569,71479564,71479564,

ADD COMMENT • link 8.5 years ago by hellbio ▴ 520

0

Entering edit mode

Please edit and delete 18,22 from cut command in these lines

echo "1/7 Preparing files"; cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$6]++' | awk '{print $6"\t"$0;}' > $dir/enst_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_end.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

ADD REPLY • link 8.5 years ago by EagleEye 7.6k

0

Entering edit mode

Unfortunately,still the same!! Uploaded a sample file for one gene with few lines (http://www.fileconvoy.com/dfl.php?id=g9e89b42792cd42a1999835989774c2d90ccf95918) if it helps!!

ADD REPLY • link 8.5 years ago by hellbio ▴ 520

score 1 · Accepted Answer · 2016-06-03

echo `date`;
dir=(`dirname $1`)

base=(`basename $1`)

echo "1/7 Preparing files";


cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

echo "2/7 sorting and joining files - Transcripts (Not applicable)";


echo "3/7 sorting and joining files - Genes";
### ENSG

sort -k1,1 ${dir}/ensg_start.tmp > ${dir}/ensg_start_sort.tmp;
sort -k1,1  ${dir}/ensg_end.tmp > ${dir}/ensg_end_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp > ${dir}/ensg_location.tmp


sort -k1,1 ${dir}/ensg_annotation.tmp > ${dir}/ensg_annotation_sort.tmp;
sort -k1,1 ${dir}/ensg_location.tmp > ${dir}/ensg_location_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_annotation_sort.tmp ${dir}/ensg_location_sort.tmp > ${dir}/ensg_annotation.txt;


echo "4/7 Distance measure - Transcripts (Not applicable)";

echo "5/7 Distance measure - Genes";

cat $dir/ensg_annotation.txt | awk 'function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}function min(x){i=max(x);for(val in x){if(i>x[val] && x[val]>1){i=x[val];}}return i;}{split($9,a,",") ; split($10,b,","); print $0"\t"min(a)"\t"max(b);}' | awk 'BEGIN{FS="\t"}{print $2"\t"$11"\t"$12"\t"$1"\t0\t"$5}' >  ${dir}/${base}_ensg_annotation.txt;

echo "6/7 Cleaning temporary files";

rm $dir/ensg_annotation.tmp $dir/ensg_start.tmp $dir/ensg_end.tmp $dir/ensg_annotation.txt $dir/ensg_location.tmp ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp ${dir}/ensg_location_sort.tmp ${dir}/ensg_annotation_sort.tmp;

echo "7/7 Done";
echo `date`;