Question

picard metrics without reference file?

0

Entering edit mode

8.0 years ago

lien ▴ 90

Hi all,

For a comparative study, we received BAM-files from a lot of different laboratories. We would like to have a look at the Fold_80_base_penalty and coverage statistics. Therefore, I used the CollectTargetedPcrMetrics part of Picard-metrics. This works fine for the BAM-files that we generated ourselves, for which we also have the correct reference sequence. However, when we try to run CollectTargetedPcrMetrics for BAM-files that we received and we do not have the correct reference sequence, we get a lot of errors. As a reference sequence, we use the standard reference sequence available here (human.g1k_v37.fa). I know for sure that the other BAM-files were also generated using hg19, but I presume there are some subtle differences in the actual reference sequences used. Below the error I get:

    java -jar picard.jar CollectTargetedPcrMetrics AMPLICON_INTERVALS=tumor_fpa.hg19.tech.target.interval_list TARGET_INTERVALS=tumor_fpa.hg19.tech.target.interval_list OUTPUT=/Participant23_sampleA_TSTPool_TST_A23_S1_pcr_metrics.txt INPUT=Part25_sampleA.bam REFERENCE_SEQUENCE=human_g1k_v37.fa VALIDATION_STRINGENCY=LENIENT
[Mon Dec 12 CET 2016] picard.analysis.directed.CollectTargetedPcrMetrics 
[Mon Dec 12 CET 2016] Executing as lien@lien on Linux 4.4.0-53-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.4.1(7c4d36e011df1aec4689b51efcada44e92d1817f_1464389670) JdkDeflater
[Mon Dec 12 CET 2016] picard.analysis.directed.CollectTargetedPcrMetrics done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: **Sequences at index 0 don't match:** 0/16571/M/M5=6b9e15a5937653b0006de435f91578c0 0/249250621/1/UR=file:/humgen/gsa-hpprojects/1kg/reference/human_g1k_v37.fasta/M5=1b22b98cdeb4a9304cb5d48026a85128
    at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:209)
    at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:271)
    at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:257)
    at picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:114)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

We need to change already the .dict and .fai file generated from our reference sequence, but still this will not solve the problem, as I think there are small differences in the actual reference sequence they used.

Are there any solutions to work around this problem? To force Picard in using the reference sequence? Thanks!

picard without reference sequence • 2.6k views

ADD COMMENT • link updated 8.0 years ago by harold.smith.tarheel ★ 5.0k • written 8.0 years ago by lien ▴ 90

WouterDeCoster · Answer 1 · 2016-12-12

1

Entering edit mode

8.0 years ago

harold.smith.tarheel ★ 5.0k

You can identify the reference that was used for alignment from the BAM file header, download it, build the .pict and .fai files, and use it as the reference for Picard.

ADD COMMENT • link 8.0 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Hi Harold, Thanks for your answer, but I'm not sure it is present in every header. Below the output of samtools view for one of our samples:

samtools view -H Part23_sampleA.bam

  @HD   VN:1.4  SO:coordinate
@PG ID:Isis PN:Isis VN:2.5.1.3
@SQ SN:M    LN:16571    M5:6b9e15a5937653b0006de435f91578c0
@SQ SN:1    LN:249250621    M5:94f0d1b1622299238a1d8a0711dd06c7
@SQ SN:2    LN:243199373    M5:854e985b2e19c122b9f67f6453965693
@SQ SN:3    LN:198022430    M5:8abc85e73b1c75518a03743de2c2b14b
@SQ SN:4    LN:191154276    M5:2d37b4e662928cc6c58b84b2a4cc8648
@SQ SN:5    LN:180915260    M5:250f4e82a213269b0a0e4aebb0468470
@SQ SN:6    LN:171115067    M5:409088215d77f0bc72364b390430a5a7
@SQ SN:7    LN:159138663    M5:ef15cde5c82fb860694bf8f611807459
@SQ SN:8    LN:146364022    M5:8fbef8c3eaaac674cc6e690d1641464b
@SQ SN:9    LN:141213431    M5:791586215804a0eac537f38b9dd4cde2
@SQ SN:10   LN:135534747    M5:b4e45e54aa2f74160a3220c8f8e30a9a
@SQ SN:11   LN:135006516    M5:87a76a8bbb98a65b792822b8f148f4f9
@SQ SN:12   LN:133851895    M5:98b7a0aeea30cc16aeb5ffdaf43c8238
@SQ SN:13   LN:115169878    M5:90a7175dfa4e9c6d23fb0ec7b1f1e880
@SQ SN:14   LN:107349540    M5:98de20ecb80c00338fe4f978b0403d2b
@SQ SN:15   LN:102531392    M5:93273e983c0e06a2ec1334454952141e
@SQ SN:16   LN:90354753 M5:d252447e0439735eac934a27e9db136f
@SQ SN:17   LN:81195210 M5:ec9adf628757584300c3fb1a236a1c33
@SQ SN:18   LN:78077248 M5:1021a29f6559ff7d70a5e363f5b1daea
@SQ SN:19   LN:59128983 M5:74894b6505db2249cebf5f4b76a2f066
@SQ SN:20   LN:63025520 M5:c3c66d396289fe70950f844868c387df
@SQ SN:21   LN:48129895 M5:3e594b3726ece31187c113dd4ac85aa8
@SQ SN:22   LN:51304566 M5:9ba0d41360be59a22a837b7690cc80ac
@SQ SN:X    LN:155270560    M5:e6c43bd374da923944dea2d6e0e6049b
@SQ SN:Y    LN:59373566 M5:17ceeafc3a372967496cd94f2572c341
@RG ID:NGSA_A   PL:ILLUMINA SM:NGSA_A

This doesn't allow to find the reference sequence that was used, or am I wrong?

ADD REPLY • link updated 8.0 years ago by WouterDeCoster 47k • written 8.0 years ago by lien ▴ 90

0

Entering edit mode

Hmmm, my headers always include a 'CL' field as part of the @PG line that shows the command line (including reference) used for alignment. Yours doesn't, but it appears to have been generated using Illumina's ISIS pipeline. You may try to contact them and see if there's a straightforward method to match the ISIS version to reference genome version.

ADD REPLY • link 8.0 years ago by harold.smith.tarheel ★ 5.0k