Question

Label fastq reads

0

Entering edit mode

7.0 years ago

L. A. Liggett ▴ 130

I am deriving some sequencing consensus reads from fastq files, and I would like to keep track of some information during the derivation process, and I'm wondering if there is a good way to do this. To simplify the problem if I have a fastq file with the following read in it, I want to associate some information with each base prior to alignment and variant calling, in this instance just a single number.

@SEQ_ID
GATC
+
!''*(

@SEQ_ID
GATC
7452    <---- associated information
+
!''*(

Then I would like this information to be accessible after alignment and variant calling. If its possible, it would be convenient to have this populated into the resulting vcf file, but not necessary if there is a better way. Here I am showing my idea of how a G>A change at the second nucleotide within the fastq file would look.

#CHROM POS    ID        REF  ALT     MYINFO
2      4370   rs6057    G    A       4

alignment sequencing • 1.8k views

ADD COMMENT • link updated 7.0 years ago by swbarnes2 14k • written 7.0 years ago by L. A. Liggett ▴ 130

0

Entering edit mode

And can you explain what the goal is of this?

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

0

Entering edit mode

The goal is a bit complicated, but essentially I have barcoded fastq reads that I am binning together and using for consensus sequence derivation. I would like to retain parts of the information in the binned reads such as percent sequence agreement at each position. Then I would be using this information to inform some statistics/confidence calculations for each identified variant.

ADD REPLY • link 7.0 years ago by L. A. Liggett ▴ 130

score 1 · Answer 1 · 2018-05-31

1

Entering edit mode

7.0 years ago

GenoMax 151k

You will be violating FASTQ format specification, if you do this. None of the aligners/tools will work properly.

Your only option is probably keeping track of what you need independently and adding it to VCF afterwards. I am not sure if that is even possible since at some point in variant calling process you are going to lose the fastq headers.

ADD COMMENT • link 7.0 years ago by GenoMax 151k

0

Entering edit mode

Right, I was thinking that there might be a way to track this information in an associated file, but if I am doing this pre-alignment, I do not know how to associate the read information with output variants.

ADD REPLY • link 7.0 years ago by L. A. Liggett ▴ 130

0

Entering edit mode

Your variants are going to be a compressed representation of a pileup at that position so even if you somehow managed to carry your special info over, which value would you select for the base (or two) that are called as SNP (if your numeric info was different for each base).

Note: I suppose you could replace real Q scores with some transformed representation of values that you want so they fit in sanger fastq scale but then aligners/variant callers would not be using real Q-scores.

ADD REPLY • link 7.0 years ago by GenoMax 151k

score 0 · Answer 2 · 2018-05-31

0

Entering edit mode

7.0 years ago

swbarnes2 14k

Most software will not care if you add things to the read name, so you can probably put your custom info there. Or, you can make your fastq into an unmapped bam, and add a custom tag there.

ADD COMMENT • link 7.0 years ago by swbarnes2 14k

0

Entering edit mode

OP stated this requirement:

I want to associate some information with each base prior to alignment and variant calling, in this instance just a single number.

which is not going to be easy to do. Idea of adding a custom tag to the BAM is fine but how that can be carried through variant calling is a valid question.

ADD REPLY • link 7.0 years ago by GenoMax 151k