I am developing a script that will count the number of times a short nucleotide sequence hits non coding regions of the human genome. Based on google searches, Blast+ appears to be the tool to use. They have a few cookbook recipes about masking a database with a FASTA files which I want to leverage.matthew_rich
I want to know if there is a way to pull all known transcripts for the human genome and put a 50-100bp buffer on the 5' and 3' ends (to avoid potential regulatory elements) and write those sequences to a file. I did not see anything on ncbi suggesting BLAST could do this task.
Does anyone have a suggestion on how to accomplish this task?
Thanks in advance.
You can download all cDNA sequences from Ensembl, not sure what you mean by the buffer sequences though.
ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/README
Sej already pointed out that you can download cDNA sequences directly. Still, I do not see any biological basis for this "buffer". Do you mean untranslated regions, or gene promoters? Please leave a comment with some more details.
Yes, extending the sequence beyond the stated gene is desirable to subsume any regulatory elements in the UTR for my mask file. I am trying to create a local blast database that represents a benign DNA, where any alterations would be presumed silent. My strategy for this would be to go through known genes and add additional bps to both ends to also block regulatory elements that my be near by. Also CDNA is undesirable since I would like to avoid all introns as well.