Entering edit mode
5 months ago
Varsha
•
0
I have a list of novel transcripts after filtering and also the gtf file and fasta file of these novel transcripts.. I also have their chromosome locations. How do I calculate their GC content and find out their exon count? How do I proceed with it? this is how my fasta file looks:
>NET.31408.1 loc:17|336737-341315|+ exons:336737-337034,338681-339015,339449-339828,341216-341315 segs:1-298,299-633,634-1013,1014-1113
CACAAACCACCCTTGCATTCTGCCAGCCACCCCTCTGCAAAACTGGTGCTCCAGCCACCAGGACCTTGAG
GCATTGCAGCCGCCCAGCCTTGTCTCCGGCACCCTCTCACCTGGAACGCCTTTGGTTCACGCTGTCTACC
TCCTCCACCTGGGGGTGGCCCAGCACCACCTCCCCTGGGAATTCTCCAGCTCCTCCCATCAGGCTCCCAT
TCGGCTTGAGCCCACCGCCCTCCCGTCAGCATTTCATTCCGCCCGCATCCTCGGGGGCATTTACCTGTTA
CCCCGATGCCCAGACATGAAACGCAGGCCTGCTTCCATTTACGTGATGTATGTGGGTCTGGAAAGTCCCA
GGCAGGCAGAATCTTTGCAGAGGAAACCTGATTTCGGCTCCCACCTGGGAACTGCTTGTTGAAGGAGCCC
AAGAGAAACCTCTCCATGAAGCAGAGAAGCTTCTAGGGAAAAAGAAGCCTCAACCCTCCTCACCCGCTTG
GAAAAGGCCCAGTCCTCAGGTGTGCTGAGGGCGGTGCTCCAGGCCCCGGGGGGCAGCGTCCCACACCCCT
GCCTCCGCCAGCAGCTTCTGCACGGCCCAGCCCAGACTCCAGCTCCCAGGTGGCTCTCCGCGGGTCCTGC
CAGCCTGACCCTGCACTACCAAACTGGGAGAGGAAGAAGCCGCCTCCATGGGTGCTGCCCACCTGCCAGG
TGCCCGCCACTGGCTGACCAACTGGAATCATCACAAGCCCCAGAGGACGACGTGATCATCACTCCTTTCA
GAAAAGAAGAAACCAGCTCGAGAGGGGCAGCCACGTGCCCAAGGCCCCATAAGCTGGCACCAGGTGCCCA
GTTTGGCCCAACGGAGCTGGGCTGAGCCCAGGTGCTTTCTATCCCCCTCCTCCTCCCAAGGCGTCGGGTT
GCAGGTGCGGTGCCTACAGGTGCCTAACGAAAGCAATGAGCCGGGTATTCTCCGAGCACCTGCCACACAC
CCAGCAGCGGGGAGCACAGAGTTCCCAGAAACTAACTTCGCAGTTTCTGGTGAACGTGTCTGACCTCTCC
TACTGGACCAAACACTCCCTCAGAGCAGGATGCCTCCTGCCCATATGGTACTGAAAACTGTGG
>NET.31408.2 loc:17|336737-342935|+ exons:336737-337034,338681-339015,342853-342935 segs:1-298,299-633,634-716
CACAAACCACCCTTGCATTCTGCCAGCCACCCCTCTGCAAAACTGGTGCTCCAGCCACCAGGACCTTGAG
GCATTGCAGCCGCCCAGCCTTGTCTCCGGCACCCTCTCACCTGGAACGCCTTTGGTTCACGCTGTCTACC
TCCTCCACCTGGGGGTGGCCCAGCACCACCTCCCCTGGGAATTCTCCAGCTCCTCCCATCAGGCTCCCAT
TCGGCTTGAGCCCACCGCCCTCCCGTCAGCATTTCATTCCGCCCGCATCCTCGGGGGCATTTACCTGTTA
CCCCGATGCCCAGACATGAAACGCAGGCCTGCTTCCATTTACGTGATGTATGTGGGTCTGGAAAGTCCCA
GGCAGGCAGAATCTTTGCAGAGGAAACCTGATTTCGGCTCCCACCTGGGAACTGCTTGTTGAAGGAGCCC
AAGAGAAACCTCTCCATGAAGCAGAGAAGCTTCTAGGGAAAAAGAAGCCTCAACCCTCCTCACCCGCTTG
GAAAAGGCCCAGTCCTCAGGTGTGCTGAGGGCGGTGCTCCAGGCCCCGGGGGGCAGCGTCCCACACCCCT
GCCTCCGCCAGCAGCTTCTGCACGGCCCAGCCCAGACTCCAGCTCCCAGGTGGCTCTCCGCGGGTCCTGC
CAGAGAATTTATAGAGTCTCATTGACCAACCAGCCAGACATGATGCTAATCTGGGTTCCAAAAACAAGAA
ACACCACGACAGATCA
and the GTF file:
17 StringTie transcript 6670990 6676823 1000 - . gene_id "NET.31822"; transcript_id "NET.31822.1";
17 StringTie exon 6670990 6671996 1000 - . gene_id "NET.31822"; transcript_id "NET.31822.1"; exon_number "1";
17 StringTie exon 6676715 6676823 1000 - . gene_id "NET.31822"; transcript_id "NET.31822.1"; exon_number "2";
8 StringTie transcript 140349489 140350371 1000 + . gene_id "NET.78699"; transcript_id "NET.78699.2";
8 StringTie exon 140349489 140349957 1000 + . gene_id "NET.78699"; transcript_id "NET.78699.2"; exon_number "1";
8 StringTie exon 140350234 140350371 1000 + . gene_id "NET.78699"; transcript_id "NET.78699.2"; exon_number "2";
3 StringTie transcript 14136345 14137669 1000 + . gene_id "NET.53089"; transcript_id "NET.53089.5";
3 StringTie exon 14136345 14136680 1000 + . gene_id "NET.53089"; transcript_id "NET.53089.5"; exon_number "1";
3 StringTie exon 14137357 14137669 1000 + . gene_id "NET.53089"; transcript_id "NET.53089.5"; exon_number "2";
20 StringTie transcript 58657036 58659388 1000 - . gene_id "NET.49267"; transcript_id "NET.49267.2";
20 StringTie exon 58657036 58657089 1000 - . gene_id "NET.49267"; transcript_id "NET.49267.2"; exon_number "1";
20 StringTie exon 58658947 58659388 1000 - . gene_id "NET.49267"; transcript_id "NET.49267.2"; exon_number "2";
8 StringTie transcript 79826214 79827927 1000 - . gene_id "NET.77436"; transcript_id "NET.77436.1";
8 StringTie exon 79826214 79826726 1000 - . gene_id "NET.77436"; transcript_id "NET.77436.1"; exon_number "1";
8 StringTie exon 79827716 79827927 1000 - . gene_id "NET.77436"; transcript_id "NET.77436.1"; exon_number "2";
17 StringTie transcript 336737 341315 1000 + . gene_id "NET.31408"; transcript_id "NET.31408.1";
17 StringTie exon 336737 337034 1000 + . gene_id "NET.31408"; transcript_id "NET.31408.1"; exon_number "1";
17 StringTie exon 338681 339015 1000 + . gene_id "NET.31408"; transcript_id "NET.31408.1"; exon_number "2";
17 StringTie exon 339449 339828 1000 + . gene_id "NET.31408"; transcript_id "NET.31408.1"; exon_number "3";
17 StringTie exon 341216 341315 1000 + . gene_id "NET.31408"; transcript_id "NET.31408.1"; exon_number "4";
3 StringTie transcript 171240150 171244033 1000 + . gene_id "NET.56472"; transcript_id "NET.56472.1";
3 StringTie exon 171240150 171240201 1000 + . gene_id "NET.56472"; transcript_id "NET.56472.1"; exon_number "1";
3 StringTie exon 171243772 171244033 1000 + . gene_id "NET.56472"; transcript_id "NET.56472.1"; exon_number "2";
2 StringTie transcript 8701295 8702421 1000 - . gene_id "NET.41416"; transcript_id "NET.41416.1";
I have difficulty reading this (format lost in translation?), but unless your trying to do this programmatically as an exercise, I would just use the available tools, maybe seqkit, fastqc, subread, or bedtools?