Hi everyone,
I would like to add the sequence of a plasmid, which is integrated in a cell line that I sequenced, with more than one relevant gene to a fasta as well as gtf file and I'm not sure how to do this correctly.
If I would only be interested in one transcript of let's say GFP, as described in the example under this link (https://groups.google.com/forum/#!msg/rna-star/FGQRotrCB1Q/oQ2annphCQAJ), you would add something like this:
fasta:
">eGFP eGFP sequence"
gtf:
"eGFP AddedGenes exon 1 720 . + 0 gene_id "eGFP"; transcript_id "eGFP";"
But how would this look if I would for example like to annotate two genes from my plasmid, e.g. gene of interest and selection marker, as well as UTRs. Would I handle the whole plasmid sequence as the "gene" in the gtf and the genes I'm interested in as transcripts or would I handle the genes (selection marker and GOI) separately? But then I'm not sure what to add to the fasta files. Would I then just add the sequence of my genes + UTRs to the fasta? But then again I'm not sure what the coordinates of the respective elements would be. Maybe someone could help me by a schematic on how this would be built up correctly.
Thanks a lot!
You can do as your example (except that obviously "sequence" stay bellow the header in fasta).
About UTRs, there is no standardizing about having it or not. There are GFFs/GTFs with lots of features like UTRs/exon/Lnc RNA/pseudogene/etc and others with only region/gene/etc. Also, this can change according to which databases and species we are talking about.
I think that the best thing you can do is look at the files from well-documented databases like Genbank/Ensembl/UCSC/etc, from well-documented models like E. coli, Saccharomyces cerevisiae, Homo sapiens, Arabidopsis thaliana, Caenorhabditis elegans, etc, etc, etc and see what they look like.