Question

barcodes not show up in overrepresented sequences in FASTQC

0

Entering edit mode

21 months ago

e.r.zakiev ▴ 250

So, I have a Ribo-seq experiment with multiple samples and for each sample there are two barcodes, supplied by the sequencing lab, like this one: TACTCATA+GCCACAGG

I just ran the FASTQC on that bugger and it didn't pop up as overrepresented sequences (nothing shows up as overrepresented, in fact).

The sequencing lab couldn't have trimmed the barcodes, as the lengths of reads are all solid 100.

Should I still try trimming them somehow? Is that even necessary? Am I confusing these unnecessarily with the adapter sequences (which should be trimmed indeed), and am, coincidentally, a total moron?

I'd normally trim adapters with bbduk.sh, but these are not adapters.

ribo-seq barcodes • 1.2k views

ADD COMMENT • link updated 21 months ago by GenoMax 150k • written 21 months ago by e.r.zakiev ▴ 250

score 4 · Accepted Answer · 2023-07-18

4

Entering edit mode

21 months ago

GenoMax 150k

Indexes (what you are referring to as barcodes above) are never part of actual reads in Illumina sequencing (order of sequencing is R1 --> I1 --> I2 -->R2). So you are not going to find them in FastQC report. They are transferred to fastq headers after demultiplexing, which is where you will find TACTCATA+GCCACAGG for the sample which had those indexes.

ADD COMMENT • link 21 months ago by GenoMax 150k

0

Entering edit mode

oki, merci!!!! Alignment to the reference transcriptome with salmon, indeed, yielded 86%+ mapping rates, so it would have been strange if these barcodes meddled with the reads with such a rate of mapping.

By the way, since you touched upon the sequencing order, the sequencing lab has provided all the files with the L002 index (i.e. no files with the L001 index, all files have L002 for some reason), and I grilled them on why they did it like that, but they said "it's normal". Also weirdly, they also added the R1 index to the file names, although the reads were clearly single-end to begin with..

ADD REPLY • link 21 months ago by e.r.zakiev ▴ 250

2

Entering edit mode

I think you may be confusing lanes with indexes. When a sample runs on multiple lanes the "lane specific" files will have L001/L002/L003/L004 in their names to signify the lane the data came from. It is possible to demultiplex the data so it is not separated by lanes (if a sample pool ran on multiple lanes). In that case you will not find any L00* in file name.

If they did give you separate files with index sequence (I2) (this is a valid requirement for some software) then it would be odd. Especially if your samples were single-end and not dual indexed. In that case that I2 file likely contains the same phantom sequence one sees when there are no indexes.

ADD REPLY • link 21 months ago by GenoMax 150k

0

Entering edit mode

ah yes, sorry. I was, indeed, referring to Illumina multiplexing lanes (L001/L002/L003/L004). I've never encountered l1/l2 indices in filenames, as is

ADD REPLY • link 21 months ago by e.r.zakiev ▴ 250

0

Entering edit mode

You will encounter I* names with single cell data (10x).

ADD REPLY • link 21 months ago by GenoMax 150k