We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed
, _Covered.bed
, _Padded.bed
, _Regions.bed
and a file named Targets.txt
). I am not really sure what they are, but my problem is more than that.
When I try to run the command
gatk BedToIntervalList \
-I input/S0276129_Covered.bed \
-O input/S0276129_Covered.intervals \
--SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict
I get the following error:
picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947
at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Which based on the message tells me that the bed files show coordinates which are not given in the dict
file for chr1.
This is true, when I look at chromosome 1 in the bed file I see:
grep "chr1\s" input/S0276129_Covered.bed | tail
chr1 196986946 196987186 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
chr1 196989335 196989485 entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
but the dict
file shows
less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict
@HD VN:1.0 SO:unsorted
...
@SQ SN:chr1 LN:195471971 UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa M5:c4ec915e7348d42648eefc1534b71c99
...
When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716
Is there something wrong with the bed file from Agilent?
Any ideas what is happening?
Thanks
I have this same issue - did you ever end up finding a fix?
wrong reference genome.
As in a HG38 vs HG39 kind of difference - or as in using the Ensemble version vs the UCSC genome provided file as I know they annotate differently? I am not quite sure the source of the Bed files from Agilent.
But after doing some checking I do believe its because my dict file is based on the Ensemble genome file while these BED files may be coming from somewhere else.
There is no "HG39". There's hg19, hg38 and T2T-CHM13.
Please do not add answers unless you're answering the top level question. Instead, use
Add Comment
orAdd Reply
as appropriate. A moderator has moved your post to the right location this time, please be more careful in the future.