I have the following data:
Chromosome Start End XS XE TranscriptID GeneID Strand
36945 chr19 54754594 54756329 b'54754594,54756005,54756286,' b'54755063,54756202,54756329,' NR_110738 LOC101928804 -
36948 chr19 54769421 54771064 b'54769421,54770669,54771021,' b'54769891,54770937,54771064,' NR_110737 LOC101928804 -
36949 chr19 54769421 54771064 b'54769421,54770740,54771021,' b'54769891,54770937,54771064,' NR_110738 LOC101928804 -
36951 chr19 54785868 54835292 b'54785868,54816103,54834899,54835251,' b'54785899,54816541,54835167,54835292,' NR_110737 LOC101928804 -
36952 chr19 54785868 54835292 b'54785868,54816103,54834970,54835251,' b'54785899,54816541,54835167,54835292,' NR_110738 LOC101928804 -
Here you see transcripts from the same gene. Start is txStart and End is txEnd. For this gene (LOC101928804), can I say that the length of the gene is from 54754594 to 54835292 (from the start of first transcript to the end of last transcript). Or is this an oversimplification in some way?
I'll not do that. If all your transcripts get an exon skipping in 3' or 5' you'll miss some information. What you can do is use BiomaRt with your
GeneID
to get the gene position from Ensembl annotationThanks. Do you know if refgene has some gene start and end info? But perhaps that is a question for another thread.
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
Gene name information is under
gene=
If this is the exon skipping you speak of I do not see why it should matter: https://en.wikipedia.org/wiki/Exon_skipping
UCSC refgene probably does not include such mistakes, no?
How did you get this from RefGene ?
These are not mistakes, just different types of transcript