Hey everyone,
I'm currently working on annotating TSS (Transcription Start Site) information, and I have come across a dilemma: should I annotate TSS by transcript or by gene? I would really appreciate your insights and opinions on this matter.
On a related note, I have been trying to implement a code to annotate a TSS bed file from non-canonical species using a GTF file independently of being a NCBI or Ensembl GTF. I would like to verify if my code is correct or if there are any improvements that can be made. Here's the code I have so far:
import argparse
def gtf_to_tss_bed(input_file, output_file):
tss_coordinates = set() # Set to store unique TSS coordinates
with open(input_file, 'r') as gtf, open(output_file, 'w') as bed:
for line in gtf:
if not line.startswith('#'):
fields = line.strip().split('\t')
if len(fields) >= 8 and fields[2] == 'transcript':
chrom = fields[0]
start = int(fields[3]) - 1 # Adjust to 0-based index
end = int(fields[4])
strand = fields[6]
transcript_id = fields[8].split('transcript_id "')[1].split('";')[0] # Extract transcript_id
if strand == '+':
tss = start
else:
tss = end - 1
coordinates = (chrom, tss, tss + 1)
if coordinates in tss_coordinates:
continue # Skip if coordinates already seen
tss_coordinates.add(coordinates) # Add coordinates to set
bed_fields = [chrom, str(tss), str(tss + 1), transcript_id, strand] # BED format: [chrom, start, end, transcript_id, strand]
bed.write('\t'.join(bed_fields) + '\n')
# Parse command-line arguments
parser = argparse.ArgumentParser(description='Convert GTF file to TSS BED file')
parser.add_argument('-g', '--gtf', help='Input GTF file', required=True)
parser.add_argument('-o', '--output', help='Output BED file', required=True)
args = parser.parse_args()
# Call the conversion function
gtf_to_tss_bed(args.gtf, args.output)
Thank you in advance,
Rafael
I've always had this doubt too, IMO probably annotating on transcript (Tx) TSS is the most useful but it also may depend on your biological question and on the data at hand.
For example, in the past I've observed for some ChIP-seq data and also for DNAm data that my signal was functionally associated with Tx TSSs (in the sense that, when exploring the data, there were specific enrichments/depletions for the factor or DNAm at the level of isoform features such as TSS) so to me it made sense to use the Tx level.
Moreover, after annotating at the Tx level, you still have the option of assigning each Tx to their corresponding gene. The only issue I can imagine with annotating many features to the Tx level (TSS, exon, intron...) is that maybe sometimes if there are too many overlapping Txs in a "gene body", in the end your final annotation (chosen maybe with some priority) would not be too informative of that structure. But again, this simplification also occurs when annotating to the gene level. Also, the concept of a "gene" and its delimitations is (to me) more obscure than the concept of a transcript. Variability in annotating to the Tx level may also be more influenced by the quality of the organisms' annotation in the sense of the confidence of all of the transcripts there described and so on.
Just some thoughts... it would be nice if someone can suggest cases when annotating to the gene level is the preferred option!
I think (generally) most TSSs for transcripts start at the same point within some margin of error. I would worry about overrepresenting some gene TSSs if the only difference between transcripts is alternative splicing.
At the end of the day, as long as it's abundantly clear in the methods section which of the two approaches was taken, and any downstream filtering needed, it shouldn't matter.
I was thinking that the best option is to do it by transcript due to this approach takes into account the alternative promoters and transcription start sites that may be specific to certain transcripts. It provides a more detailed and fine-grained representation of TSS usage within a gene, allowing for a deeper understanding of transcript-specific regulation. However, the TSS enrichment will be higher ¿? because you will be able to annotate more TSSs. Also for ATAC-seq or CHIP-seq studies, the TSS of the gene not always is the correct one for your tissue. It is true that if they share TSS, it is also not informative, but at least more informative than by gene.