Question

Picard ExtractIlluminaBarcodes error

1

Entering edit mode

7.4 years ago

mark.rose ▴ 50

Hello

I tried posting this question to the Broad's GATK help forum, as suggested in the picard documentation, but haven't yet gotten a response so I'm posting it here to the good people of Biostars. I'm using ExtractIlluminaBarcodes (picard version 2.18.7) for the first time and am encountering an error with the command:

java -jar picard.jar ExtractIlluminaBarcodes \
BASECALLS_DIR=/project/JIY3012/work/data/BaseCalls/ \
LANE=1 \
READ_STRUCTURE=250T8B250T \
BARCODE_FILE=/project/JIY3012/work/data/barcode_file \
METRICS_FILE=250T8B250T_metrics_output.txt \
NUM_PROCESSORS=36 \
MAX_MISMATCHES=0

This yields:

11:03:10.765 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rosema1/BioInfo/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Jun 14 11:03:10 EDT 2018] ExtractIlluminaBarcodes BASECALLS_DIR=/project/JIY3012/work/data/BaseCalls LANE=1 READ_STRUCTURE=250T8B250T BARCODE_FILE=/project/JIY3012/work/data/barcode_file METRICS_FILE=250T8B250T_metrics_output.txt MAX_MISMATCHES=0 NUM_PROCESSORS=36 MIN_MISMATCH_DELTA=1 MAX_NO_CALLS=2 MINIMUM_BASE_QUALITY=0 MINIMUM_QUALITY=2 COMPRESS_OUTPUTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jun 14 11:03:10 EDT 2018] Executing as rosema1@usrebcs11.nafta.syngenta.org on Linux 2.6.32-696.18.7.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.7-SNAPSHOT
INFO 2018-06-14 11:03:10 ExtractIlluminaBarcodes Processing with 36 PerTileBarcodeExtractor(s).
[Thu Jun 14 11:03:10 EDT 2018] picard.illumina.ExtractIlluminaBarcodes done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Exception in thread "main" picard.PicardException: Expected CycledIlluminaFileMap to contain 8 cycles but only 0 were found!

at picard.illumina.parser.CycleIlluminaFileMap.assertValid(CycleIlluminaFileMap.java:66)
at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:407)
at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292)
at picard.illumina.ExtractIlluminaBarcodes$PerTileBarcodeExtractor.(ExtractIlluminaBarcodes.java:750)
at picard.illumina.ExtractIlluminaBarcodes.doWork(ExtractIlluminaBarcodes.java:317)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Perhaps this has something to do with my READ_STRUCTURE string (250T8B250T). These libraries were sequenced with dual unique barcodes with UMIs. I am interested in processing them using single indices (hence my attempted use of 250T8B250T), dual unique indices (250T8B8B250T), and dual unique indices with UMIs (250T8B9M8B250T). I am not confident that these READ_STRUCTURES are correct or if this is the cause of the error. Note that I tried the other READ_STRUCTURES I mentioned but got similar errors.

Additionally, my barcode file looks like this:

barcode_sequence_1 barcode_sequence_2 barcode_name library_name
CTGATCGTNNNNNNNNN ATATGCGC Dual Index UMI Adapter 1 GAR2161A459
ACTCTCGANNNNNNNNN TGGTACAG Dual Index UMI Adapter 2 GAR2161A460
TGAGCTAGNNNNNNNNN AACCGTTC Dual Index UMI Adapter 3 GAR2161A461
GAGACGATNNNNNNNNN TAACCGGT Dual Index UMI Adapter 4 GAR2161A462
CTTGTCGANNNNNNNNN GAACATCG Dual Index UMI Adapter 5 GAR2161A463
TTCCAAGGNNNNNNNNN CCTTGTAG Dual Index UMI Adapter 6 GAR2161A464
CGCATGATNNNNNNNNN TCAGGCTT Dual Index UMI Adapter 7 GAR2161A465
ACGGAACANNNNNNNNN GTTCTCGT Dual Index UMI Adapter 8 GAR2161A466
CGGCTAATNNNNNNNNN AGAACGAG Dual Index UMI Adapter 9 9
ATCGATCGNNNNNNNNN TGCTTCCA Dual Index UMI Adapter 10 10
GCAAGATCNNNNNNNNN CTTCGACT Dual Index UMI Adapter 11 11
(etc.)

I included all 384 barcodes as I am interested in observing any cross-talk that occurs.

Thank you for your help

Mark

picard extractilluminabarcodes demultiplex • 5.3k views

ADD COMMENT • link 7.4 years ago by mark.rose ▴ 50

0

Entering edit mode

@Mark: You may want to look at a tool written specifically for handling UMI's (UMI-tool). deML may also be a possible option.

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

I was originally looking at UMI-tool but then switched to the picard/fgbio approach as it is what is recommended by IDT, the supplier of the unique, dual index, UMI adapters that are being used in this study. If I can't get this approach to work, I will further explore your suggestions. Thanks

ADD REPLY • link 7.4 years ago by mark.rose ▴ 50

0

Entering edit mode

Expected CycledIlluminaFileMap to contain 8 cycles but only 0 were found!

That error seems to indicate that your file format for the barcode file is incorrect. Is that a tab delimited file?

Tab-delimited file of barcode sequences, barcode name and, optionally, library name. Barcodes must be unique and all the same length. Column headers must be 'barcode_sequence' (or 'barcode_sequence_1'), 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'.

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

Yes, it is tab-delimited. And I have tried a few different versions of it as well with no success, including only using 3 columns when trying to demultiplex based on a single index and reordering the columns to "barcode_name library_name barcode_sequence_1 barcode_sequence_2" as suggested in some online posts.

ADD REPLY • link 7.4 years ago by mark.rose ▴ 50

GenoMax · Accepted Answer · 2018-06-19

1

Entering edit mode

7.4 years ago

mark.rose ▴ 50

OK, I have discovered the cause of this problem. I did not have the right information about the length of the reads in this run and so my READ_STRUCTURE was wrong. If anyone encounters this error, refer to the file RunInfo.xml in the base directory of the sequencing run to make sure you have the correct information. Mine for instance is:

[rosema1@demeter ~]$ cat  /data/SBI_Illumina_GA2_runs/180323_M00831_0294_0000000 00-BKBM5/RunInfo.xml

<RunInfo xmlns:xsd="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema"="" rel="nofollow">http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.o rg/2001/XMLSchema-instance" Version="2">
  <Run Id="180323_M00831_0294_000000000-BKBM5" Number="294">
    <Flowcell>000000000-BKBM5</Flowcell>
    <Instrument>M00831</Instrument>
    <Date>180323</Date>
    <Reads>
      <Read NumCycles="100" Number="1" IsIndexedRead="N"/>
      <Read NumCycles="17" Number="2" IsIndexedRead="Y"/>
      <Read NumCycles="8" Number="3" IsIndexedRead="Y"/>
      <Read NumCycles="50" Number="4" IsIndexedRead="N"/>
    </Reads>
    <FlowcellLayout LaneCount="1" SurfaceCount="2" SwathCount="1" TileCount="19"/>
  </Run>

From this you can derive the following READ_STRUCTURES (NOTE that read number 2 above in my case is the 8bp i7 index + the 9bp UMI)

single index 100T8B9S8S50T (or 100T8B17S50T) dual index 100T8B9S8B50T dual index w UMI 100T8B9M8B50T

Mark

ADD COMMENT • link updated 7.4 years ago by GenoMax 154k • written 7.4 years ago by mark.rose ▴ 50

0

Entering edit mode

That is a pretty non-standard kit specific read architecture.

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

Perhaps, but it appears to be correct in that: 1) it is written here by the sequencer during the run; and 2) I no longer get an error when running ExtractIlluminaBarcodes and the results produced are sane.

That said I am getting an error when trying to execute the next step in the prescribed process, IlluminaBasecallsToSam.

Exception in thread "main" picard.PicardException: Could not find a format with available files for the following data types: Position

(sigh, always a new problem)

ADD REPLY • link 7.4 years ago by mark.rose ▴ 50

0

Entering edit mode

A technician did set that run up with those parameters before the sequencer wrote the file :-)

If you are able to use a different tool then I recommend trying reformat.sh from BBMap suite. Something like:

reformat.sh in1=R1.fq.gz in2=R1.fq.fz out=file.sam or file.bam

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

I was able to check with them and they confirmed this was the intended structure (they use it for QC runs. I was under the impression that this was the actual production run and that is why my specified read structure was what itoriginally was and wrong)

I'm presuming your reformat.sh comment was meant to address my second problem. At this point I'm still in the pre-fastq stage of the process and would seemingly have to generate them first before possible using it.

ADD REPLY • link 7.4 years ago by mark.rose ▴ 50

0

Entering edit mode

Correct. I thought you had gone past the fastq creation step and were having trouble creating the sam file.

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

IlluminaBasecallsToSam skips the formation of fastqs and goes directly to unaligned BAM files. These BAM files contain a tag (RX) for each read that specifies the associated UMI. From here (presuming I can find a way past here) its on to Picard's SamToFastq to generate fastqs. These fastqs contain no information on the UMIs. The fastqs are mapped and then the unaligned BAMs are merged (Picard MergeBamAlignment) with the aligned BAMs to incorporate the UMI info into the aligned BAM. At this point you are ready to call consensus reads with the UMI associated BAMs via the fgbio tool. This is the process recommended by IDT, the supplier of the unique, dual indexed adapters with UMIs that were used here.

ADD REPLY • link 7.4 years ago by mark.rose ▴ 50