Hello,
I have several .fq files containing 5bp inline barcodes at the beginning of each read such as (barcodes are between *) :
@gi|110640213|ref|NC_008253.1|_418_952_1:0:0_1:0:0_0/1
*CCAGG*CAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCATCTGGTAGCGATGAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_31_476_0:0:0_0:0:0_1/1
*CAGAT*GGTTGGTGATTTTGGCGGGGGCAGAGAGGACGGTGGCCACCTGCCCCTGCCTGGCATTGCTTTCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_210_743_2:0:0_1:1:0_2/1
*CATTA*CCACCACCATCACCATTACCACAGGAAACGGTGCGGGCTGACGCGTACAGGAAACACCGAAAAAA
+
2222222222222222222222222222222222222222222222222222222222222222222222
I would like to modify these sequences in order to have the same for each read (here it would start by AAAAA):
@gi|110640213|ref|NC_008253.1|_418_952_1:0:0_1:0:0_0/1
*AAAAA*CAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCATCTGGTAGCGATGAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_31_476_0:0:0_0:0:0_1/1
*AAAAA*GGTTGGTGATTTTGGCGGGGGCAGAGAGGACGGTGGCCACCTGCCCCTGCCTGGCATTGCTTTCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_210_743_2:0:0_1:1:0_2/1
*AAAAA*CCACCACCATCACCATTACCACAGGAAACGGTGCGGGCTGACGCGTACAGGAAACACCGAAAAAA
+
2222222222222222222222222222222222222222222222222222222222222222222222
I want to make sure that only the sequence at the beginning of the reads are modified and not throughout the read itself. The barcode sequence might be present within reads and I don't want to modify it.
Do you know any easy way to do this? Thanks!
Works like a charm! Thanks Gabriel. I was trying things with awk but I was not successful. This solves my issue. Also yes, the * are not part of the sequence :-)
you are most welcome, mark the question as answered if you please :-)