Parsing Fastq Files
2
Hi all,
I have Fastq reads something like
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
NACCCTAGAAATTATAAATCTCTTCAAGTGAGATTGTAAGGAGAAGGAGAAACTTGGTCTGGAATTTGTTATAAAAGCACTT
+
#1=DDFFFHHGGHIJJJJJIJJJJJJJJCHGHIIJJEFHIJIJJIIJIIIIJHHIJJFHIIJJJJJJJIJIJIJIIJHEHHHHFFFFFFEEEDEEEDCDDC
I aligned this fastq file with a reference genome using bowtie. How can I identify the sample name from this record?
I have demultiplexed fastq files for each sample and I also have barcode information file in the format
sample name Index sequence
BC1 CGATGT
BC2 CGATGA
When I try to retrieve the alignment information using $sam->features() the seqID will be returned as
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918
How can I get the 1:N:0:CGATGT part from the alignment information?
Thanks,
Deeps
fastq
parsing
• 4.6k views
I'd suggest that you use SAM Read Groups to track samples. This would be done at the alignment stage....
If you want to keep the barcode in SAM file, you can add a non-space character in between the main header and the barcode section.
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
to be
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT
here I used a colon ":", so if you parse this header, you can use split function to get the barcode.in Python
header="@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT"
barcode=header.rstrip("\n").split(":")[-1]
Normally, most of the mapper, i.e BWA or BOWTIE will truncate the header name after a space.
so if you preprocess your FASTQ file into this new format you will save alot time. Otherwise, if you are not able to do the modification on the FASTQ reads, you can open the original FASTQ file and SAM file at same time to calibrate the line numbers and parse out the barcode.
Login before adding your answer.
Traffic: 2568 users visited in the last hour
Good suggestion. It helped me a lot