Hello, I have an annotation.gtf of a particular plant species and a fasta file of filtered transcripts of that plant. I need to filter out certain transcripts from that fasta file having less than two exons. My gtf file looks like this :
scaffold_1 AT_DB_2015 exon 29512 30439 . - . transcript_id "AT000010.1"; gene_id "AT000010"; gene_name "AT000010";
scaffold_1 AT_DB_2015 exon 30816 31164 . - . transcript_id "AT000010.1"; gene_id "AT000010"; gene_name "AT000010";
My fasta file of transcripts:
>TCONS_00000187
GCAGACAAAAATCAGGTAAATGGTAAAGTTTACTTGCAGAAACAACCacaaaaagaaaca
agcaagaaaGCCGACAGAAAACATTtggaaatttattaatgttTCAATGTTTACAAGGCT
GATGAATATACAATTAGACCTCAGGCTCAGGAGGGTCAACGGTCTCAGCAGGCGAAGGAT
CAGCTACCTCGGGGAGTGGAAGATCAGCTGTATTaggaggaggaggcatggacgA
>TCONS_00000188
GCTCTCCCCTACTTTTCGACCATAGATTTACTTCTTCAGCTGAACCgaaaatttaaaaga
ataagAGTGTAAGCAAAGTAATACTATTTagattaggggtgggcaaaatcaaATTCGGAT
TGAGTTCGGATCAATCTGATCAGATTTCAAATTGTTCGAGTTGGGTAAAAATTAATCTGA
TGCAACCCGAACTTACAACCCGACTCAATCCGATTAGATATATATTCAAATCGGATCAGA
TCAGGTAAAAAAATCAGATTGAATTCGGATATGATTGGGTTCGAATTACAACAATCAAAT
ACTAAACATTAATTTGCTATCTATTAAATTAGTAACTATATATACCATCAAAGCTTCAAA
Is there any awk command (or any command) to filter out transcripts having < 2 exons and to keep those having >= 2 exons? Any help will be highly appreciated. Thanks.
what's the link between fasta and gtf? You can group by gene/transcript and filter by exon number. Please post more lines and example output.
I am so sorry for the previous gtf file. The correct gtf file is this :
From here, I want to extract TCONS ids with exon numbers >= 2. The previous fasta file is the sequence information of all tcons of this gtf. Any command based solution will be highly helpful. Thank you.
Since you are running, Tuxedo suite, I assume that gffread is installed on your system. Run the following code and this would print transcript in the first column and number of exons in corresponding transcript would be in second column.
If you still have issues in running the code, please upload example gtf file some where and share the location. Parsing copy/pasted gtf is time consuming.
This is more robust than previous code, but you need to know where exonCount and other attributes are coming in gffread output:
Transcripts.gtf is taken from example files provided by gffread on github.
in R with dplyr:
base R:
This would print gene, transcript, and number of exons
Thank you so much, it worked for me!