I am using pybedtools to get promoter sequences for my genes of interest. I am almost done with the script. But I would also like to include gene names in the fasta headers. By default, it only include corrdinates in fasta headers.
Below is my script:
coords=Chr1 1000 2000 forward 1 +
a = pybedtools.BedTool(coords, from_string= True)
fastaseq = pybedtools.example_filename(genome_file)
a = a.sequence(fi=fastaseq, s=True)
promoterseq= open(a.seqfn).read()
output:
Chr1:1000-2000(+)
TTTGGGGTTATAAATTATTAGAAGTT......
Expected output:
genename:Chr1:1000-2000(+)
TTTGGGGTTATAAATTATTAGAAGTT......
I was wondering if there is a way to include the gene name in fasta header.
Thanks, R
Do you have a file or a dataset with gene names that correpond to these promoter coordinates? What nomenclature for gene names you want to use? Also tagging by transcript id (preferably RefSeq transcript) is a good idea from my experience because of extra data on something more stable relative to the assembly changes and it captures alternative TSSs better.
And I personally would add distance from TSS or translation start site relative to the transcript and with specific reference used sotred together alignments of transcripts to that particular reference. But this is just me =)