Hello all, is there an already available script to obtain the list of the barcodes embed in the header of the reads from my multiplexed fastq files? Thanks a lot!
Hello all, is there an already available script to obtain the list of the barcodes embed in the header of the reads from my multiplexed fastq files? Thanks a lot!
I suppose your header are like this:
@7001367R:585:HNHVHBCXY:1:1102:1267:2073 1:N:0:ATTACTCG+TATAGCCT
where HNHVHBCXY
is your barcodes. If your fastq files are in gzip format. Do following:
zcat file.fastq.gz | awk 'NR%4==1' | cut -d ':' -f3 | uniq
When you say a list my mind is immediately going to python but you haven't tagged it so I will give an awk solution for you:
zcat file.fastq.gz | awk 'NR == 1 || NR % 4 == 0 ' | awk -F ':' '{print $12}
Your fastq is a little weird in the header line, but This will print the first line, and every fourth line (which should be header if it is like header, sequence, +, quality ( the NR % 4 == 0) bit may need to be changed to NR % 5. Then it will set the field separator to : and select the column with the index info, but you will also pick up some stuff on the back end too. You will definitely get indexes!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you show the first 4 lines of your fastq? Either
Or
The solution will be something like
But to know exactly, we need an example from the header.
Sure! here they are! Thank you!
Just a clarification: I don't want to cut them out from the header but just have a list of them by file (sample). Thanks for your good disposition!!
Please dont post answers instead of comments, unless answering the question.
Also to ensure readability of posts, please use the code markup button around and code out input/output (the button with
101010
on)It is essentially the same answer as doctor.dee005, but mine counts the number of occurences of each barcode (
uniq -c
), and his is more elegant as it considers only header lines.I am assuming the barcodes are the
TTCTGCCT|0|TGAACCTT|0
part.**edit: probably the cut command has to be: