Entering edit mode
7.4 years ago
jjrin
▴
40
Hello I have been trying to run a program called Slide that identifies novel isoforms de novo. I gave it an input of a gtf file (required but not what I'm trying to look through) and a bam file with an index and a directory of all of the chromosome individual fasta files. However I keep getting this error: ValueError: invalid reference 6L
, even though I am completely sure that none of the chromosomes are called 6L, they are all properly named "chr6L" in both the BAM and gtf and fasta files. So I am unsure of what the problem is here. Thanks for your help!
Here is the entire error message:
Traceback (most recent call last):
File "slide.py", line 1901, in <module>
gene_linear_models = get_linear_models( genes.values(), samfile, read_type, frag_len, total_num_reads, choose_frag_len_dist, if_use_GC_correction, gene_name_specified, mode=mode )
File "slide.py", line 1262, in get_linear_models
bin_built = build_bins( gene, samfile, read_type, new_exons )
File "slide.py", line 605, in build_bins
for rl, read in process_reads( gene, samfile, read_type ):
File "slide.py", line 369, in process_reads
for read in samfile.fetch( gene.boundaries.chr, gene.boundaries.start, gene.boundaries.stop ):
File "pysam/calignmentfile.pyx", line 897, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:10997)
File "pysam/calignmentfile.pyx", line 819, in pysam.calignmentfile.AlignmentFile.parse_region (pysam/calignmentfile.c:10455)
ValueError: invalid reference `6L`
Where did you get this program from?
It's possible that it is splitting the identifiers that have 'chr' in the name, and then it can't find the reference.
https://www.ncbi.nlm.nih.gov/pubmed/22135461
This is the paper the program is based from, I found it from the "A survey of best practices for RNA-seq data analysis" paper in isoform discovery. I found some code where it possibly splices out "chr" but I'm unsure of what to adjust.
class GeneBoundaries( dict ): Exon = namedtuple('Exon', ['chr', 'strand', 'start', 'stop'])
data is a list of items, split by spaces from each line in the file. data[0] is the chromosome id.
This part says if the chromosome id starts with 'chr', and 'ifchr' is False, take the 4th (python starts at 0) character in the chromosome id to the end of the chrom id. [3:]. if it does not start with chrom and ifchr is True, add 'chr' to the id.
The problem is I don't know what the variable ifchr is, but in your case it is evaluating to false, and it is trimming 'chr' off your ids.
Here is the code that creates the variable "ifchr"
### handle the required arguments ###
I think what is causing this problem is that some of the chromosome names in my reference fasta file is "Scaffold100" and there are over 400 thousand variants of this, this is likely what's causing "ifchr" to evaluate to false. But at the end of the fasta files there are indeed chromosomes named chr6L, etc. I believe that the bam files uses chr6L rather than 6L so there's no need to trim them. Would it be best just to remove these lines and keep them as is or is there another problem with them?
Since you don't know everything that's going on in the code, leave that alone, you don't know if variables/functions are called elsewhere or not. You should double check, and change if necessary all the ids to be identical across files, and not to violate the if/else statements.
I'm assuming this program should work out of the box for the type of files you're giving it. So, also double check how your files were created, and if there are and paramters or flags that you can pass to the program to alleviate issues.