Question

How to strip barcodes from demultiplexed data

0

Entering edit mode

3.9 years ago

MboiTui ▴ 20

Dear BioStars community,

I am working for the first time with 'raw' sequencing data (in the format of fastq files). The data is single end GBS data produced with two restriction enzymes.

The sequencing centre provided the data already demultiplexed, but with the barcodes still present in line at the start of the read.

Here the first two lines from one fastq file

@HISEQ:658:CDPMCANXX:6:1101:8843:1997 1:N:0:
NACAGCAGACAGTGCAGTTTTACCTCAGAAACCACATATGCATGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT

The metadata file provides a barcode9I (GACAGCAGACAGTGC) and a barcode (GACAGCAGACAG) for this individual.

First of all, what is the difference between the two?

Furthermore, how can I remove the barcodes as part of the process_radtags module, considering that the data has already been demultiplexed (i.e., one fastq file per individual)?

Cheers

stacks • 2.2k views

ADD COMMENT • link updated 3.9 years ago by GenoMax 147k • written 3.9 years ago by MboiTui ▴ 20

0

Entering edit mode

Are you following stacks manual (LINK)?

ADD REPLY • link 3.9 years ago by GenoMax 147k

0

Entering edit mode

Hello GenoMax,

Thanks for your answer. I apologize if my question was very broad and made it look like I was asking for someone to do my work. it was not my intention.

I have been looking at the manual, but I was a bit confused. It stated to not include the barcode file if the data had already been demultiplexed, but that ended up retaining the barcodes after QC (not even sure how well QC was then being performed, if at all).

I struggled for a bit, and finally ended up with the following:

process_radtags -p ./rawdata/ -o ./samples/ -b ./metadata/barcodes_file -c -q -r --disable_rad_check --inline_null

I used the --disable_rad_check option because otherwise all sequences would be discarded, despite good phred scores. When searching for the cut site sequences in my data, I could not find them, thus I initially believed they were already removed by the sequencing company.

I then read a few blog posts (e.g., https://groups.google.com/g/stacks-users/c/LQ6cyOruXh8?pli=1) and it made me think that I am doing something wrong.

I believe the barcode9I sequence contains the remainder of the cut site sequence. So I will now try with the barcode sequence (instead of the barcode9I sequence) and retain the restriction enzyme information (--renz_1 pstI --renz_2 sphI)

ADD REPLY • link 3.9 years ago by MboiTui ▴ 20

0

Entering edit mode

I now ran the following code:

process_radtags -p ./rawdata_try/ -o ./samples/ -b ./metadata/DFr19-4488_Barcodes2.txt -c -q -r --inline_null --renz_1 pstI --renz_2 sphI

It returned the following message. I believe the module is now running correctly, but will inspect the outputs to better assess that

Processing single-end data.
Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loaded 117 barcodes (10-14bp).
Will attempt to recover barcodes with at most 1 mismatches.
Processing file 1 of 1 [1872148.FASTQ.gz]
  Processing RAD-Tags...1M...
  1558043 total reads; -0 ambiguous barcodes; -14 ambiguous RAD-Tags; +19514 recovered; -17815 low quality reads; 1540214 retained reads.
Closing files, flushing buffers...
Outputing details to log: './samples/process_radtags.rawdata_try.log'

1558043 total sequences
      0 barcode not found drops (0.0%)
  17815 low quality read drops (1.1%)
     14 RAD cutsite not found drops (0.0%)
1540214 retained reads (98.9%)

EDIT: All retained sequences now start with TGCAG, with i believe is part of the pstI cut site. Not sure why that would be the case.

When I ran the command with the barcode9I barcodes and with --disable_rad_check option that was not the case :/

ADD REPLY • link 3.9 years ago by MboiTui ▴ 20

0

Entering edit mode

I linked the manual just to make sure you had seen it and were following the procedure described.

The sequencing centre provided the data already demultiplexed, but with the barcodes still present in line at the start of the read.

Looking at the fastq header you posted in original question, it would appear that your data is not-demultiplexed as far as Illumina indexes go. Is that correct? Normally there would be an index sequence at the end of the header and it will look like 1:N:0:ATGCGTA.

ADD REPLY • link 3.9 years ago by GenoMax 147k

0

Entering edit mode

Being so new to this pipelines, I am not sure. This is what was stated when downloading the fastq files from the sequencing centre:

Files are provided demultiplexed and have been named by target ID. No filtering has been applied. The demultiplexing barcodes have not been stripped.

All sequences within one fastq files share the same inline barcode, and each fastq file is named according to sample ID. I received as many fastq files as i have sampled individuals

ADD REPLY • link 3.9 years ago by MboiTui ▴ 20