Hello,
I used UCSC Table Browser to generate a fasta where each entry is a CDS within a given region. The parameters I used are the following:
- group: Genes and gene predictions
- track: Ensembl genes
- table: ensGene
- region: position chr20:250000-1000000
- output format: sequence
- output file: myfile.fasta
Then:
- sequence type for Ensembl Genes : genomic
And in "sequence retrieval region options": CDS Exons (only) and "One FASTA record per region (exon, intron, etc.)".
Now in the resulting fasta some identical sequences occur several times, with the same range but a different ID, for instance:
>hg19_ensGene_ENST00000217233_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......
> (...)
>hg19_ensGene_ENST00000449710_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......
Why does this occur and how can I obtain a fasta where each entry is a unique range?
Thanks.
That gene has multiple transcripts, which is what you're seeing. Download the whole genome and the annotation file and use R or bioperl/biopython.