I have paired-end Illumina reads with barcode and primer sequence. Barcode and primer sequence are just in .txt file. The experiment was following: Primer was used for PCR and then they hanged the experiment tag (barcode) and the adapter. So, the read are following:
barcode_sequence-PCR_primer_sequence-fragment
I want to demultiplex the reads according to the barcode_sequence and then cut off the primer sequence. Till now I have tried following:
QIIME: split_libraries_fastq.py
I do not have the barcode read fastq files, I have only the sequences of barcode and primers. I contracted the mapping file:
#SampleID BarcodeSequence LinkerPrimerSequence Description
1 TCGCAGG AACCTGGTTGATCCTGCCAGT C4363F2_18.7.
2 CTCTGCA AACCTGGTTGATCCTGCCAGT C4363F2_19.7.
So, I need to define -barcode_type not-barcoded. It showed me an error that I need to specify --sample ids, as I had only one input fiel, I have only one sample id
split_libraries_fastq.py -m mapping.txt -i Pool1_18S.fastq -o demultiplexed_output/ --barcode_type not-barcoded --sample_ids 1
I get one seqs.fna file where all reads have attached following:
orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
Stacks: process_radtags
process_radtags -p /fastq -I -b /mapping_radtags.txt --inline_inline -o /demultiplexed_output
However, it asks me to specify the restriction enzyme used. But I do not have this information.
So, what I need: I have several experiments identified by barcode. I need to demultiplex it. I cannot just search for the barcode in the sequence and say that this sequence belongs to the experiment. It can happen that there is a sequencing error in the barcode, so that I need to define a hamming (or any other) distance between the real barcode sequence and the sequence in the read. Which program can do this?
Not an answer but a small comment.
Ideally, you should not simply use hamming distance, you should use likelihood. If you have a mismatch with a qc score of 2 and a mismatch with a qc of 40, the former has a greater likelihood given a certain barcode than the latter. We published a paper about maximum-likelihood demultiplexing: https://grenaud.github.io/deML/
If you want to code a bit, you could modify it to incorporate barcode information into the likelihood computation then set a cutoff on the final likelihood for a sample. Furthermore, the likelihood of sample bleed-ins could be computed effortlessly.
If your sequencing primer is internal to the barcode, how would you have sequenced the barcode? Which sequencing protocol (on the sequencer) was used?
In addition, don't remove your previous question here, that's not good practice.
I was told that they used primer for PCR, then they hanged the barcodes and adapters and sequenced it. What I find my reads is that the barcode comes before the primer, so it corresponds exactly to what I was told.
PCR primer ≠ sequencing primer. The sequencing primer anneals to the adapter (unless a custom primer was used), so the structure of your library is:
yes, exactly, I corrected it
If your barcodes are at the very beginning of the reads then why are you having an issue demultiplexing? BBMap may be useful: A: Demultiplexing fastq files with dual barcodes There is also sabre.
After demultiplexing with BBMap, you can trim the PCR primer sequence from the reads with the same software:
I am having an issue because when the adapter was cut away, it could have happened that some first bases of the barcode were cut out as well
Then you are out of luck. You may have to go back and find the original dataset.
Cand I just search for the half of the barcode in the reads, for instance? It is my original dataset already.
If your barcodes were long enough to begin with then you could, otherwise you would not be able to discriminate, even if you found the remaining barcode fragment.
How long were the barcodes and how many were there? You could look to see if you can identify the PCR primer sequence in the data and get the remaining barcode to the left of that. This would likely need either some custom code and/or awk type solution.