Combine subject with sample type in a scatter?
2
0
Entering edit mode
5.6 years ago
alanh ▴ 170

I have CWL that runs a pair of tumor-normal samples for a given subject.

For later variant calling, I want to add the read names to be something like $(subjectName)_$(fastqs.sample)

The inputs are like this:

inputs:
  fastqs:
    type:
      type: array
      items:
        type: record
        fields:
          - name: sample
            type: string
          - name: files
            type:
              type: array
              items: File
  referenceFasta:
    type: File
  subjectName:
    type: string

steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: ${MAGIC LINE HERE)  # <---- WHAT GOES HERE?
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct

Can someone tell me what should go into the sample_name thing to make this work? \

I have successfully inserted the fastqs.sample as the name using $(self.sample) so I know the underlying code works.

cwl • 1.9k views
ADD COMMENT
0
Entering edit mode

Can you elaborate on the kind of problem your example causes? Is it related to the scattering or the referencing of subjectName in the context of the step?

ADD REPLY
0
Entering edit mode

In a later step, the Mutect2 somatic variant caller seems to name the FORMAT data column in its output VCF using the value in the read. In my above example, if the "$(self.sample)" is either "tumor" or "normal" depending on the sample type. The reads get named "tumor" or "normal" based on that.

The problem occurs after that when I try to build a panel of normals (PON) from the normal samples, and if they're all named the same thing (Normal, Normal, Normal, etc), the PON creation step barfs because they're all the same names.

ADD REPLY
0
Entering edit mode

Can adding the subjectName solve this? Only a single subject name is given to the workflow. Wouldn't they still all have the same (albeit longer) name?

Do the fastq files have unique names? If so, you could add their nameroot to the sample names to distinguish between them.

ADD REPLY
1
Entering edit mode

I've tried a bunch of iterations here:
valueFrom: "$(subjectName)_$(self.sample)" valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)}

and they all fail with various issues.

ADD REPLY
1
Entering edit mode

Try adding "subjectName" to inputs and then referring to it this way:

valueFrom: "$(inputs.subjectName)_$(self.sample)"
ADD REPLY
4
Entering edit mode
5.6 years ago
alanh ▴ 170

To specifically put the answer in context, the correct method is to do the following:

steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      subject_name:  subjectName  # THIS brings the subjectName value into local context
      sample_name:
        source: fastqs
           # inputs.subject_name below refers to the "# THIS ..." line above
        valueFrom: "$(inputs.subjectName)_$(self.sample_name)" 
    out: 
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
ADD COMMENT
2
Entering edit mode
5.6 years ago
Tom ▴ 540

Regarding the issues with valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)} and similar constructs:

inputs will in this context not reference the workflows inputs, but the inputs of the step. Looking at the valueFrom section in this segment of the specifications leads me to believe there is no way to reference the subjectName-input of the workflow in the StepInputExpression. You would have to pass subjectName to your align_sort.cwl-Tool and concatenate the names there.

Another option would be this horrid workaround i use:

Add in input parameter to your align_sort.cwl

[...]   
  inputs:
      namesource:
        type: string? #This might also be File etc.
[...]

You don't give it an input binding, so the tool itself will never use it. It's optional, so everything will run fine if you don't provide it to the tool. But you can pass subjectName to the workflow step as in input parameter and reference it in the javascript expression:

[...]
steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      namesource: subjectName
      sampleName: 
        valueFrom: $((inputs.namesource)+ (whatever.you.like))
[...]
ADD COMMENT
0
Entering edit mode

How about this:

  sampleName: 
    valueFrom: |
      ${
        if (inputs.subjectName) {
          return inputs.subjectName + "_" + inputs.sampleName
        } else {
          return inputs.sampleName
        }
      }
ADD REPLY
0
Entering edit mode

As i mentioned i: I don't think it is possible to reference inputs.subjectName unless subjectName is an input of the WorkflowStep. Also why reference inputs.sampleName inside of the expressions that is supposed to yield inputs.sampleName? I feel like i'm fundamentally misunderstanding what is to be accomplished here.

ADD REPLY
0
Entering edit mode

So, trying to provide a more precise solution. You can get a combination of the workflow input subjectName and the sample field of the fastqs array (as demanded in the opening post) by doing the following: Add this to the inputs section of align_sort.cwl:

[...]
  namesource:
    type: string?
[...]

Then modify the steps section of the workflow:

[...]
steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      namesource: subjectName
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: $((inputs.namesource)+"_"+(self.sample))
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
[...]

This should work from a technical perspective and achieve what you asked in the opening post. But since subjectName is only a single string and you said the sample array is always just "tumor" or "normal" i doubt doing this will solve the problem. From my understanding, you need individual names for each sample.

How are the fastq-files named? Maybe you could also add something like +"_"+(self.files.nameroot) to the end of the sampleName strings to distinguish them.

ADD REPLY
1
Entering edit mode

Thanks, this answered my question and it was hard to shift the contexts in my head.

ADD REPLY

Login before adding your answer.

Traffic: 2059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6