Entering edit mode
4.7 years ago
newbie
▴
130
From one of my analysis, I have found some novel lncRNAs, which are not annotated in Gencode and they are in a gtf file which looks like below:
My gtf [example]:
chr17 StringTie transcript 49187581 49191235 1000 + . gene_id "MSTRG.100038"; transcript_id "MSTRG.100038.1"; class_code "u"; transcript_length "1188"; lncRNA_type "LincRNA";
chr17 StringTie exon 49187581 49187711 1000 + . gene_id "MSTRG.100038"; transcript_id "MSTRG.100038.1"; exon_number "1"; class_code "u"; transcript_length "1188"; lncRNA_type "LincRNA";
chr17 StringTie exon 49190179 49191235 1000 + . gene_id "MSTRG.100038"; transcript_id "MSTRG.100038.1"; exon_number "2"; class_code "u"; transcript_length "1188"; lncRNA_type "LincRNA";
chr17 StringTie transcript 49479713 49480376 1000 - . gene_id "MSTRG.100058"; transcript_id "MSTRG.100058.1"; class_code "u"; transcript_length "664"; lncRNA_type "LincRNA";
chr17 StringTie exon 49479713 49480376 1000 - . gene_id "MSTRG.100058"; transcript_id "MSTRG.100058.1"; exon_number "1"; class_code "u"; transcript_length "664"; lncRNA_type "LincRNA";
chr17 StringTie transcript 47869876 47875390 1000 - . gene_id "MSTRG.100064"; transcript_id "MSTRG.100064.9"; class_code "u"; transcript_length "5364"; lncRNA_type "LincRNA";
chr17 StringTie exon 47869876 47873933 1000 - . gene_id "MSTRG.100064"; transcript_id "MSTRG.100064.9"; exon_number "1"; class_code "u"; transcript_length "5364"; lncRNA_type "LincRNA";
And I downloaded the mitranscriptome.gtf
from here Mitranscriptome and below I'm showing some example from the gtf:
chr1 mitranscriptome transcript 11017 15297 1000.0 - . tcat "pseudogene"; gene_id "G000001"; tss_id "TSS000001"; uce "FALSE"; transcript_id "T000001"; tstatus "annotated"; t
genic "NA"; func_name_final "NA";
chr1 mitranscriptome transcript 11017 29382 1000.0 - . tcat "pseudogene"; gene_id "G000001"; tss_id "TSS000002"; uce "FALSE"; transcript_id "T000002"; tstatus "annotated"; t
genic "NA"; func_name_final "NA";
chr1 mitranscriptome exon 11017 11526 1000.0 - . exon_number "0"; tcat "pseudogene"; gene_id "G000001"; tss_id "TSS000001"; uce "FALSE"; transcript_id "T000001"; tstatus "anno
tated"; tgenic "NA"; func_name_final "NA";
chr1 mitranscriptome exon 11017 11526 1000.0 - . exon_number "0"; tcat "pseudogene"; gene_id "G000001"; tss_id "TSS000002"; uce "FALSE"; transcript_id "T000002"; tstatus "anno
tated"; tgenic "NA"; func_name_final "NA";
chr1 mitranscriptome transcript 11993 13957 1000.0 + . tcat "pseudogene"; gene_id "G000002"; tss_id "TSS000003"; uce "FALSE"; transcript_id "T000003"; tstatus "annotated"; t
genic "NA"; func_name_final "NA";
chr1 mitranscriptome exon 11993 12227 1000.0 + . exon_number "0"; tcat "pseudogene"; gene_id "G000002"; tss_id "TSS000003"; uce "FALSE"; transcript_id "T000003"; tstatus "annotated"; tgenic "NA"; func_name_final "NA";
chr1 mitranscriptome exon 12613 12721 1000.0 + . exon_number "1"; tcat "pseudogene"; gene_id "G000002"; tss_id "TSS000003"; uce "FALSE"; transcript_id "T000003"; tstatus "annotated"; tgenic "NA"; func_name_final "NA";
I would like to overlap my gtf with lncRNAs I found from my analysis with mitranscriptome gtf file and find the real novel lncRNAs which are not found in mitranscriptome.
For this I did like below:
bedtools intersect -v -b mitranscriptome.v2.gtf -a myAnalysis.lncRNAs.unique.gtf > myAnalysis.lncRNAs.unique.NOT.IN.MITRANSCRIPTOME.gtf
Is the above usage of betools
intersect right way to get the novel one?