How to create a list of the barcodes located in the read headers of demultiplexed fastq files?
3
0
Entering edit mode
6.4 years ago
DVR ▴ 30

Hello all, is there an already available script to obtain the list of the barcodes embed in the header of the reads from my multiplexed fastq files? Thanks a lot!

next-gen sequence • 3.8k views
ADD COMMENT
1
Entering edit mode

Can you show the first 4 lines of your fastq? Either

head -n 4 file.fastq

Or

zcat file.fastq.gz | head -n 4

The solution will be something like

zcat file.fastq.gz | cut -f 2 | sort | uniq

But to know exactly, we need an example from the header.

ADD REPLY
0
Entering edit mode

Sure! here they are! Thank you!

@M01380:50:000000000-AV1DH:1:1101:13660:1636 1:N:0:M154:16S_V1V3 TTCTGCCT|0|TGAACCTT|0 CS1_534R_YM3_for|4|27|
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGGATCTGACCAGCTTGCTGGTTGGTGAGAGTGGCGAACGGGTGAGTAATGCGTGACCAACCTGCCCCATGCTCCAGAATAGCTCTTGGAAACGGGTGGTAATGCTGGATGCTCCAACTTGACGCATGTCTTGTTGGGAAGGGGTTTTGGGCATGGGGTGGGGGTGGGTCCCTTCCGGCTTTAGGGGGGGGTATGGGCCACCTTGGCCTTGGTGGGGGACCCGCCTGAGGGGGG
+
GFFGGG<FFEGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGEGGGGGGGGCECEGGGGGFFGGGGGG@=FGGCF+BFFGGGGEGGEGDDFFG@DDGGECGFG9FGFFGGGFGGGGGGFGGGFGGGGFEG<,DFFAFFGFGGGFCG:<DDGG9FG*:CE9CEF?EFF>*:>****3C7***3***3A8*2/:8C**1*)/21***0.9))7/)*+2*)7>C))1)0C)*97)74)*0)0*).)*(07)(((((0(((((((-(75
ADD REPLY
0
Entering edit mode

Just a clarification: I don't want to cut them out from the header but just have a list of them by file (sample). Thanks for your good disposition!!

ADD REPLY
0
Entering edit mode

Please dont post answers instead of comments, unless answering the question.

Also to ensure readability of posts, please use the code markup button around and code out input/output (the button with 101010 on)

ADD REPLY
0
Entering edit mode
zcat file.fastq.gz | cut -f 3 | sort | uniq -c

It is essentially the same answer as doctor.dee005, but mine counts the number of occurences of each barcode (uniq -c), and his is more elegant as it considers only header lines.

I am assuming the barcodes are the TTCTGCCT|0|TGAACCTT|0 part.

**edit: probably the cut command has to be:

cut -d " " -f 3
ADD REPLY
0
Entering edit mode
6.4 years ago

I suppose your header are like this:

@7001367R:585:HNHVHBCXY:1:1102:1267:2073 1:N:0:ATTACTCG+TATAGCCT

where HNHVHBCXY is your barcodes. If your fastq files are in gzip format. Do following:

 zcat file.fastq.gz | awk 'NR%4==1' | cut -d ':' -f3 | uniq
ADD COMMENT
0
Entering edit mode

Just a clarification: I don't want to cut them out from the header but just have a list of them by file (sample). Thanks for your good disposition!!

ADD REPLY
0
Entering edit mode

Pretty sure that is not the barcode (though I might be mistaken), according to Illumina, that's the flowcell ID (though in principle the approach will work if you change the column number):

http://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm

ADD REPLY
0
Entering edit mode
6.4 years ago
drkennetz ▴ 560

When you say a list my mind is immediately going to python but you haven't tagged it so I will give an awk solution for you:

zcat file.fastq.gz | awk 'NR == 1 || NR % 4 == 0 ' | awk -F ':' '{print $12}

Your fastq is a little weird in the header line, but This will print the first line, and every fourth line (which should be header if it is like header, sequence, +, quality ( the NR % 4 == 0) bit may need to be changed to NR % 5. Then it will set the field separator to : and select the column with the index info, but you will also pick up some stuff on the back end too. You will definitely get indexes!

ADD COMMENT

Login before adding your answer.

Traffic: 2201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6