I want to run some algorithms on splice site mutations. This is what I have done (or at least tried) so far:
- Download all sequences of RefSeq from category NM_* via UCSC Table Browser in Fasta format.
- Create index files with Samtools and bwa.
- Read the files for my Maven Java program with the HTSJDK Library
As I run my program I get an Exception from HTSJDK because there are multiple RefSeq entries with the same name:
Exception in thread "main" htsjdk.samtools.SAMException: Contig 'hg19_refGene_NM_001037501' already exists in fasta index.
in the following line of my code:
FastaSequenceIndex faIndex = new FastaSequenceIndex(new File("data/RefSeqSequencesGRCh37_NM.fa.fai"));
These RefSeq entries are on both strands (+ and -) and have different positions. The sequences show some differences in the sequences, too. But usually not far from each other on the same chromosome.
The Questions
- Why are there multiple RefSeq sequences with the same name?
- Is there a way HTSJDK can handle fasta sequences with the same name?
- Am I doing something completely wrong or inconvenient?
Since I am using ClinVar, I do only have RefSeq accession numbers.
I can't use HTSlib because it is for C and I want to extend a bigger Java project.
Shouldn't a Variant have a reference to a unique transcript?
BTW, since you presumably do need the sequence, the FAI files are actually pretty simple to parse and retrieve sequence from. I suspect that you'll have to write your own parser/sequence extractor that will allow iterating over all instances of a gene in the file and either report all unique sequences or all compatible sequences.
Ah, I had missed the Clinvar tag in your post. I have to admit that I'm not familiar enough with Clinvar to offer much guidance. Realistically, those variants were most likely originally mapped against the genome and then simply annotated with gene information. Whether those original genome mappings are on Clinvar or not I don't know, however.