Hello everyone,
I have a question regarding my single-cell RNA-seq data. I have the following pair-end data in fastq.gz
format.
Read1 (contains 6bp UMI, followed by 6bp cell barcode info and the rest is a polyT stretch):
@J00182:79:HV2WWBBXX:6:1101:11160:38873 1:N:0:ACAGTG
GAGAAGACAGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTT
+
AAAFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ--A-----7AJJFAF-AJJJJJJJJJJJ<A<----A
Read2 is the normal read that I am using for mapping to the reference and the corresponding pair-end mate of the above read looks like this-
@J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
Now I would like to append the cell barcode and UMI info from the read1 sequence in front of the header of my read2 in the following format- @6bpCellbarcode_6bpUMI#Read2header
(with an underscore in between Cellbarcode and UMI and a hash between UMI and the rest of the header).
Example output-
@ACAGTG_GAGAAG#J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ACAGTG
is the cell barcode and GAGAAG
is the UMI. Note that the order is flipped here in the output as Read1 first contains UMI and later the cell barcode while the output I need is vice versa.
Can someone please tell me how to do that?
as usual, thank you so much!
fastq.gz - is that two fastq files or one interleaved ?
Interleaved fastq files are an atrocity against nature.
Hi both of you! UMI tools worked perfectly! Thanks for the parse code too. It saved me some time to write my own! Thanks again!