Question

Error in converting bed file to interval using Picard

0

Entering edit mode

5.3 years ago

Assa Yeroslaviz ★ 1.9k

We have a WES data set which was done using the Agilent Mouse exome capture library kit. I wanted to download the target file and got, similar to this post, a folder with several bed files (_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt). I am not really sure what they are, but my problem is more than that.

When I try to run the command

gatk BedToIntervalList \
  -I input/S0276129_Covered.bed \
  -O input/S0276129_Covered.intervals \
  --SEQUENCE_DICTIONARY ../reference/mm10/mm10.dict

I get the following error:

picard.PicardException: Start on sequence 'chr1' was past the end: 195471971 < 196469947
        at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:143)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

Which based on the message tells me that the bed files show coordinates which are not given in the dict file for chr1.

This is true, when I look at chromosome 1 in the bed file I see:

grep "chr1\s" input/S0276129_Covered.bed |  tail
chr1    196986946       196987186       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...
chr1    196989335       196989485       entg|Cr2,ens|ENSMUST00000082321,ref|NM_007...

but the dict file shows

less ../reference/mm10/Sequence/WholeGenomeFasta/genome.dict 
@HD     VN:1.0  SO:unsorted
...
@SQ     SN:chr1 LN:195471971    UR:file:/illumina/scratch/iGenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa   M5:c4ec915e7348d42648eefc1534b71c99
...

When I search for the gene Cr2, its coordinates are Chromosome 1: 195,136,811-195,176,716

Is there something wrong with the bed file from Agilent?

Any ideas what is happening?

Thanks

WES exome BedToIntervalList picard agilent • 2.4k views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 5.3 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I have this same issue - did you ever end up finding a fix?

ADD REPLY • link 9 months ago by Rashid • 0

1

Entering edit mode

wrong reference genome.

ADD REPLY • link 9 months ago by Pierre Lindenbaum 165k

0

Entering edit mode

As in a HG38 vs HG39 kind of difference - or as in using the Ensemble version vs the UCSC genome provided file as I know they annotate differently? I am not quite sure the source of the Bed files from Agilent.

But after doing some checking I do believe its because my dict file is based on the Ensemble genome file while these BED files may be coming from somewhere else.

ADD REPLY • link 9 months ago by Rashid • 0

0

Entering edit mode

There is no "HG39". There's hg19, hg38 and T2T-CHM13.

ADD REPLY • link 9 months ago by Ram 45k

0

Entering edit mode

Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate. A moderator has moved your post to the right location this time, please be more careful in the future.

ADD REPLY • link 9 months ago by Ram 45k