Error in converting bed file to interval using Picard
0
0
Entering edit mode
5.0 years ago
Assa Yeroslaviz ★ 1.9k

We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt). I am not really sure what they are, but my problem is more than that.

When I try to run the command

gatk BedToIntervalList \
  -I input/S0276129_Covered.bed \
  -O input/S0276129_Covered.intervals \
  --SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict

I get the following error:

picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947
        at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

Which based on the message tells me that the bed files show coordinates which are not given in the dict file for chr1.

This is true, when I look at chromosome 1 in the bed file I see:

grep "chr1\s" input/S0276129_Covered.bed |  tail
chr1    196986946       196987186       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
chr1    196989335       196989485       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...

but the dict file shows

less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict 
@HD     VN:1.0  SO:unsorted
...
@SQ     SN:chr1 LN:195471971    UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa   M5:c4ec915e7348d42648eefc1534b71c99
...

When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716

Is there something wrong with the bed file from Agilent?

Any ideas what is happening?

Thanks

WES exome BedToIntervalList picard agilent • 2.2k views
ADD COMMENT
0
Entering edit mode

I have this same issue - did you ever end up finding a fix?

ADD REPLY
1
Entering edit mode

wrong reference genome.

ADD REPLY
0
Entering edit mode

As in a HG38 vs HG39 kind of difference - or as in using the Ensemble version vs the UCSC genome provided file as I know they annotate differently? I am not quite sure the source of the Bed files from Agilent.

But after doing some checking I do believe its because my dict file is based on the Ensemble genome file while these BED files may be coming from somewhere else.

ADD REPLY
0
Entering edit mode

There is no "HG39". There's hg19, hg38 and T2T-CHM13.

ADD REPLY
0
Entering edit mode

Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate. A moderator has moved your post to the right location this time, please be more careful in the future.

ADD REPLY

Login before adding your answer.

Traffic: 2492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6