This is a cross-post from here: https://gatkforums.broadinstitute.org/gatk/discussion/13395/sequences-at-index-0-dont-match-but-using-the-same-reference-genome#latest
I am trying to run CollectMultipleMetrics from Picard tools on a CRAM file but I get an "Sequences at index 0 don't match, but using the same reference genome". What can I do to resolve this?
Below what I have tried so far:
I am trying to run CollectMultipleMetrics on a CRAM file with the following code:
java -jar picard.jar CollectMultipleMetrics I=SRR1378155_SRR1378155.cram O=SRR1378155_SRR1378155.multiplemetrics R=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa
but I get the error
htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 0 don't match: 0/248956422/chr1/M5=2648ae1bacce4ec4b6cf337dcae37816/UR=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa 0/248956422/chr1/M5=1f9b4bfb2d6193a45e52901b8aa4339e/UR=file:/apps/data/ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.gz
I am not sure where the "UR=file:/apps/data/ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.gz" comes from, as when I look in the CRAM header it shows "UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa" for the contigs.
I tried resorting using the reference from last post here: http://seqanswers.com/forums/archive/index.php/t-14635.html with
java -jar picard.jar ReorderSam INPUT=SRR1378155_SRR1378155.cram OUTPUT=SRR1378155_SRR1378155.sorted.cram REFERENCE=/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa
And tried it suing the .gzipped reference file as well, but for both I got the same sequence don't match error.
When I use ValidateSAM on the CRAM file, I get "Exception in thread "main" htsjdk.samtools.cram.CRAMException: Contig chr1 not found in the reference file.", from answer here https://gatkforums.broadinstitute.org/gatk/discussion/7944/can-validatesamfile-check-on-cram-files I converted my CRAM to BAM file and then ran validateSAM, which gave me:
HISTOGRAM java.lang.String
Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 124248788
I then added the read group to the BAM file and after that there are no errors found in the BAM file, but when I try to use CollectMultipleMetrics I again get the "Sequences at index 0 don't match" error.
First couple of lines with CRAM header, shows UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa:
@HD VN:1.5 SO:coordinate
@PG ID:STAR PN:STAR VN:STAR_2.5.1b CL:STAR --runThreadN 8 --genomeDir /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/STAR/2.5.1b-foss-2015b/ --genomeLoad NoSharedMemory --readFilesIn /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_1.fastq.gz /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /local/5210636/SRR1378155_SRR1378155. --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMismatchNmax 6 --quantMode GeneCounts --twopassMode Basic
@PG ID:scramble PN:scramble PP:STAR VN:1.14.6 CL:scramble -I bam -O cram -r /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa -t 2 /local/5211312/SRR1378155_SRR1378155.fixmates.bam /groups//umcg-biogen/tmp04/biogen/input/GTEx/pipelines/results///cramFiles/SRR1378155_SRR1378155.cram
@CO user command line: STAR --outFileNamePrefix /local/5210636/SRR1378155_SRR1378155. --readFilesIn /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_1.fastq.gz /groups/umcg-biogen/tmp04/biogen/input/GTEx/sra/fastq/cart_prj8709_201809140901/SRR1378155_2.fastq.gz --readFilesCommand zcat --genomeDir /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/STAR/2.5.1b-foss-2015b/ --genomeLoad NoSharedMemory --runThreadN 8 --outFilterMultimapNmax 1 --outFilterMismatchNmax 6 --twopassMode Basic --quantMode GeneCounts --outSAMunmapped Within
@SQ SN:chr1 LN:248956422 M5:2648ae1bacce4ec4b6cf337dcae37816 UR:/apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa
Hello,
apps
at the root level? What's the output ofls -l /apps
?/
in the path (Don't know if this is a problem).fin swimmer
Yes the path exists and the double / is not a problem. If the reference file can not be found it gives this error:
Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: /fake/path/genome.fa
Hello again,
could you post the content of the index file for the reference genome (the one ending wit
.fai
)? If not already done index it bysamtools faidx /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa
fin swimmer
head -n30 /apps/data//ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.fai
This looks good.
The documentation for
CollectMultipleMetrics
doesn't mention that it supportscram
. But you wrote you also have converted it tobam
and added a ReadGroup to the header. Could you please provide the complete commands you have used and show the complete header of the finalbam
?fin swimmer
The problem was in the .dict file, which was apparently made using the gzipped reference genome fasta, while the CRAM was made using the unzipped reference genome fasta. I remade the .dict file with the unzipped genome fasta file and now it works.
Thanks for the help!