If the index sequences in my Illumina FASTQ headers aren't the barcodes, then what are they?
4
6
Entering edit mode
8.4 years ago
jenn.drummond ▴ 100

Hi, everybody. I have FASTQ headers of the form

@FCC3KD2ACXX:6:1101:1545:2184#ATCACGATC/1

The "ATCACGATC" portion of these older-style headers is supposed to be the "index sequence", or the molecular barcode of a multiplexed sample, according to the Wikipedia article. But I know what the barcodes are, and that isn't one of them. All the barcodes for this project are 6-8bp, not 9, and there are "only" about 300 legitimate barcodes in the lane.

Overall, there are over 255,000 different individual ones of these index sequences on various different reads, out of about 178M reads in the file. Some of them even contain Ns, but they're all exactly 9bp long. And this particular sequence (ATCACGATC) is by far the most prevalent -- it's on 90% of the reads, so it can't be succeeding at separating anything out very specifically.

I'm coming in late to a project, and all I have to go on is the FASTQ files, the list of barcodes, and some wet-lab protocol docs that I'm not particularly qualified to interpret. Any idea what this odd extraneous-looking sequence is? If so, thanks in advance!

next-gen sequencing fastq illumina • 12k views
ADD COMMENT
0
Entering edit mode

Some of them even contain Ns, but they're all exactly 9bp long.

Therefore it's not barcode metadata, it's something that was sequenced. I'd imagine it's the first 9 bases of the sequenced DNA, that has been cut out of the SEQ and pasted into the QNAME, probably as a ghetto method for parsing 9bp barcodes in the past.

EDIT: I wonder if the machine has given different barcodes for pairs of the same read...?

ADD REPLY
0
Entering edit mode

@John: With an over-clustered flowcell it is possible to get N's in tag reads (one or more and in particularly bad cases all can be N's).

A good sequencing facility will not release this kind of data (unless for investigative purposes). Here OP has been left holding the bag with a mishmash of info and has to make sense of it all.

ADD REPLY
0
Entering edit mode

Right, but what i'm saying is that these 9bp come from the read, not from some metadata added programatically. For example, I assume the machines usually read the first few bases, try to align the barcodes it knows about to 'call' a reads barcode, then add the barcode it thinks it matches to the QNAME. This way even if a base has an N in the sequenced barcode, the barcode in the QNAME will be correct. Or if the first base was omitted from sequencing somehow, it's still matched as being barcode X. But this machine, since it's putting Ns in the QNAME, is probably just cutting the first 9bp out of the read, and sticking it in the header - losing the QUAL score in the process for all 9bp. Real lazy.

ADD REPLY
2
Entering edit mode

Sure. That 9 bp came from a real read. Even though OP has 6-8 barcodes someone chose to run this as a 9 cycle (or 10, one base at end is removed by default) run.

In case of Illumina technology both tag reads are read completely independently of the main R1/R2 reads (sequence goes R1 --> Index 1 --> Index 2 --> R2). During the run the machine has no way of matching barcodes to anything. De-multiplexing is handled by the CASAVA/bcl2fastq software during post-processing of the run.

As an aside: There is an easy way to get the index reads in fastq (with Q-scores) format as separate files (then one ends up with R1, R2, R3, R4 files, corresponding to the order of reads above) with bcl2fastq2 v.2.x with a command line option. With older versions of CASAVA there was some editing of run files required to do that.

Reason one gets N's in the tag reads is the same as the main reads. If the clusters become too fat/are too close, then the basecalling software is no longer able to call a base and will use N. This happens when a flowcell is over-clustered (and there also tag reads are more vulnerable than the main reads). It is possible to get perfectly good sequence on main reads but lose the tag reads, rendering the run useless.

ADD REPLY
0
Entering edit mode

Ah, i didn't know it was a totally separate read. I assumed it was sequenced with everything else and then cut out programatically afterward. A totally separate sequencing step eh? Gosh.

So now it makes sense to me - thank you both genomax2 and igor :)

ADD REPLY
1
Entering edit mode

The barcode read is a separate read. If you really want, you can get those quality scores.

ADD REPLY
0
Entering edit mode

Thanks for the excellent discussion, which I'm just now getting around to reading in detail, but continues to be relevant since I'm dealing with several different flavors of unpleasantly non-"standard" data. Every bit of crowdsourced clue I can get helps! This particular dataset did turn out to respond well to standard demultiplexing, but not all of them do. I had the pleasure of learning informatics in a top-of-the-line organization with excellent standards, where I had access to everyone from the lab techs to the people running the mapping pipeline in case questions arose. In my new situation...not so much. :)

ADD REPLY
4
Entering edit mode
8.4 years ago

Your sequence string 'ATCACGATC' is a perfect match to the TruSeq adapter index 1. The length (9bp) is atypical but not unheard-of. The length of the index read is often index-length-plus-one (i.e., seven cycles for a 6mer), so it appears that the sequencer was programmed for nine index cycles (8mer+1). The last nucleotide is typically trimmed from the data by default (--use-bases-mask = I8n), but the full length can be specified by the user (I9).

As @igor indicates, the amount of data suggests that the file was not demultiplexed. However, given that 90% of your data are a perfect match, it's also unlikely to contain multiple libraries. Best guess is that some other lanes of this flow cell did contain 8mer indices, and CASAVA doesn't allow lane-specific BCL-to-FASTQ conversion (so the data in all lanes have identical formats).

ADD COMMENT
0
Entering edit mode

This is an older run, so I can't speak for CASAVA 1.7. However, 1.8 allows lane-specific BCL-to-FASTQ conversion.

ADD REPLY
0
Entering edit mode

This would have been useful to know! Our facility always ran CASAVA multiple times to generate different data outputs. Just for my edification, what's the CASAVA 1.8 syntax to extract (for example) a 6bp index from lane 1 vs a 8bp index from lanes 2-8?

ADD REPLY
0
Entering edit mode

The latest version of bcl2fastq is 2 where you can specify exact lanes.

With bcl2fastq 1.8, you have to run it separately for each barcode size, so you do have to run it multiple times. However, you don't have to process each lane twice. In your sample sheet, only include lanes you want to process and it will ignore any lanes that are not mentioned in the sample sheet.

ADD REPLY
0
Entering edit mode

Okay, that's exactly what we did.

ADD REPLY
1
Entering edit mode
8.4 years ago

It's possible that they ARE your barcodes, and the metadata is wrong or you're looking at the wrong data.

If your reads are paired, you can determine the actual adapter sequence using BBMerge with the "outadapter" flag; the barcode will be embedded in the adapter sequence (may be reverse-complemented, due to the chemistry). So you might try that, to determine the barcodes empirically.

Also, the number of bases in your barcode and the number of bases sequenced in the barcode phase are not necessarily correlated. Someone may have run 9 cycles even though the barcodes were only 6 bases long. In that case, the last 3 bases would just be part of the adapter sequence and not technically the barcode.

ADD COMMENT
1
Entering edit mode
8.4 years ago
GenoMax 147k

What is in the sequence file should be absolute data (provided it has not been tampered with). If the barcodes you have do not match the data then you should check on the validity of barcodes first (also check that that list you have is not reverse complement of what is in the file, the barcode match may start 2-3 bp into what is in the reads, this is a common error some experimental people do).

If you are positive that the barcodes are correct then

  1. you may have a wrong data set - if none (or few) of your barcodes match what is there
  2. data you need is mixed with something else - you would need to parse reads that match your barcodes out. Like @Brian said you could only look for the significant bases (6-8) from the beginning of the tag read and ignore the rest.

But before doing any of this you should take a few reads (that match your barcode selections and that predominant barcode) and do quick blast @NCBI to confirm that they are at least from the genome they are supposed to be from. If not, you have a problem on your hands.

ADD COMMENT
0
Entering edit mode

What a horrible thought...and therefore a useful one for the future. I'll remember that as I go through further forensics!

ADD REPLY
1
Entering edit mode
8.4 years ago
igor 13k

I am somewhat concerned your FASTQ headers are not standard. Specifically, there is no instrument ID (see discussion here: Illumina Instrument Type from fastq? ). There may be other modifications that are not obvious that would make troubleshooting more difficult.

We don't know if the samples were demultiplexed. It's possible that they weren't. The FASTQ is just showing you the barcode sequence corresponding to each read. Does the filename contain a barcode? By default, it should.

The files are probably not demultiplexed because you say you have 178M reads in that file, but 300 samples. 178M reads is roughly what you would expect from a full lane of sequencing.

If we assume this is a demultiplexed file, don't expect all of the barcode sequences to match your barcode. When demultiplexing, you can specify the number of mismatches. It's common to allow 1 mismatch (Illumina's own BaseSpace service does that). That will yield a lot of barcodes that are not identical to your barcode sequence, but still correspond to it. What are your barcodes? How do they compare to that sequence your posted? What are the other barcode sequences you see?

Finally, this is really the question for your sequencing facility. We can only speculate here, but they can easily tell you exactly what they did.

ADD COMMENT
2
Entering edit mode

Strictly speaking, there is no standard. There is just a collection of unspoken rules. If there was a standard, there wouldn't be this problem :)

They also wouldn't be a need for every read to contain the same useless prefixes/suffixes. They might as well have made their machines write a QNAME of "Our_Dear_Leader_Illumina#[5 to 10 random digits to ruin compression]#barcode#current_exchange_rate_of_£_to_$/1"

ADD REPLY
0
Entering edit mode

There is no "standard", but the Illumina sequencers come with default settings that have been fairly consistent for the last few years.

ADD REPLY
0
Entering edit mode

Right, a collection of unspoken rules.

ADD REPLY
0
Entering edit mode

Spoken rules. Ask Illumina and they will tell you exactly how the headers are generated. It's not a secret.

ADD REPLY
1
Entering edit mode

I mean they are not spoken in the FASTQ specification. They are additional "rules" for how Illumina wants to encode per-read metadata using a format that was not designed for per-read metadata. Illumina's decision to do that rather than find another/proper way to encode per-read metadata with sequence and quality has ultimately led to the existence of this thread/problem. I mean, i'd give Illumina some credit, they're not a huge company with unlimited resources. It's not like they made $1.57 billion dollars of gross profit in the last year. Oh wait. No. They did.

"Yeah just stuff it in the template ID. They'll figure it out."

ADD REPLY
2
Entering edit mode

I am going to abandon this thread before we get to the phred33 vs phred64 debate. At least the header line doesn't affect most downstream tools.

ADD REPLY
0
Entering edit mode

You get to do that when you invent a new technology (and then make everyone else run with it).

ADD REPLY
0
Entering edit mode

This data looks to be from older versions of CASAVA (v.1.7 and before) which was the "standard" (for what it is worth) then.

ADD REPLY
0
Entering edit mode

Eh, I don't think it's fair to say Illumina defined the standard for the Sanger Institute's FASTQ format. It may be standard given the prevalence of the Illumina machines, but I think it's wrong to say it is the standard or even a standard.

The reason i'm making a big deal out of it is that I think a lot of confusion in Bioinformatics happens when people think there are some rule going on, but really there are no rules. Assumptions about data is the number 1 cause for mistakes, so even if the header looked like it followed the CASAVA format perfectly, no one should try to interpret it because the header only is supposed to have 1 job - to be unique among reads that come from the same sequenced fragment. CASAVA ruined that with /1 and /2, and all this psudo-metadata about barcodes/etc does is blow up the file size of everything.

ADD REPLY
0
Entering edit mode

To be explicit, I was referring to the style of the fastq header in the example posted in the OP as standard for CASAVA v.1.7 and before :-)

ADD REPLY
0
Entering edit mode

I haven't thought about that. I haven't worked with CASAVA 1.7 headers, so it wasn't obvious to me.

However, CASAVA 1.8 has been around since 2011. Has the OP been sitting on this data for 5 years?

ADD REPLY
0
Entering edit mode

I have worked with every public version of CASAVA :-)

Has the OP been sitting on this data for 5 years?

Someone was. OP inherited this project with minimal information.

ADD REPLY
0
Entering edit mode

Yep, you nailed it. :)

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6