Question

Why are there multiple RefSeq sequences occurring with the same name?

2

Entering edit mode

9.8 years ago

gresserT ▴ 50

I want to run some algorithms on splice site mutations. This is what I have done (or at least tried) so far:

Download all sequences of RefSeq from category NM_* via UCSC Table Browser in Fasta format.
Create index files with Samtools and bwa.
Read the files for my Maven Java program with the HTSJDK Library

As I run my program I get an Exception from HTSJDK because there are multiple RefSeq entries with the same name:

Exception in thread "main" htsjdk.samtools.SAMException: Contig 'hg19_refGene_NM_001037501' already exists in fasta index.

in the following line of my code:

FastaSequenceIndex faIndex = new FastaSequenceIndex(new File("data/RefSeqSequencesGRCh37_NM.fa.fai"));

These RefSeq entries are on both strands (+ and -) and have different positions. The sequences show some differences in the sequences, too. But usually not far from each other on the same chromosome.

The Questions

Why are there multiple RefSeq sequences with the same name?
Is there a way HTSJDK can handle fasta sequences with the same name?
Am I doing something completely wrong or inconvenient?

ClinVar RefSeq HTSJDK ucsc • 3.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by gresserT ▴ 50

Ram · Answer 1 · 2015-02-23

5

Entering edit mode

9.8 years ago

Devon Ryan 104k

If a gene apparently has mulitple copies throughout the genome then they can get the same ID. This is a reason to use Ensembl's database, since they'll have unique IDs there.
I don't use the JDK (I use HTSlib), but I would guess not. If you never need to actually get the sequence of any of these then it might be worthwhile coding around this. However, I suspect that you do need the sequence, in which case it's often ambiguous what the correct sequence is.
No, just use Ensembl's database instead and save yourself the headaches.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

1

Entering edit mode

Since I am using ClinVar, I do only have RefSeq accession numbers.

I can't use HTSlib because it is for C and I want to extend a bigger Java project.

Shouldn't a Variant have a reference to a unique transcript?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by gresserT ▴ 50

2

Entering edit mode

BTW, since you presumably do need the sequence, the FAI files are actually pretty simple to parse and retrieve sequence from. I suspect that you'll have to write your own parser/sequence extractor that will allow iterating over all instances of a gene in the file and either report all unique sequences or all compatible sequences.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

1

Entering edit mode

Ah, I had missed the Clinvar tag in your post. I have to admit that I'm not familiar enough with Clinvar to offer much guidance. Realistically, those variants were most likely originally mapped against the genome and then simply annotated with gene information. Whether those original genome mappings are on Clinvar or not I don't know, however.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

Ram · Answer 2 · 2015-03-05

2

Entering edit mode

9.7 years ago

Kim ▴ 100

This is because the UCSC RefGene track is merely an alignment track of NCBI's known RefSeq dataset, not the genome placement of those records as is provided by NCBI. It is not at all surprising that close paralogs will have more than one alignment placement.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Kim ▴ 100

Ram · Answer 3 · 2015-03-05

1

Entering edit mode

9.7 years ago

Kim ▴ 100

NCBI's placement of RefSeq transcripts is available as a GFF file: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.28_GRCh38.p2/GCF_000001405.28_GRCh38.p2_genomic.gff.gz

Genomic and transcript FASTA files are also available at this location. (Note an update is coming soon)

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Kim ▴ 100

0

Entering edit mode

This looks very nice, but do you by any chance know about similar file with unspliced genes, not transcripts (I want introns, too)?

ADD REPLY • link 9.7 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

I haven't found anything so far. Looks like getting the whole hg19 (GRCh37 or GRCh38) (from UCSC) is the way to go.

I took the whole chromosome-wise GRCh37 as Reference chromFa.tar.gz from UCSC bigZips and RefGene-File to get the Transcripts positions and the Exons positions.

Edit: I found a similar question: Convert Nm_ Mrna Position Into Corresponding Grch37 Genomic Dna Position?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by gresserT ▴ 50