I have Illumina paired end reads micro-RNA seq data (which I did not generate) that I need to analyze. I understand with micro-RNA, extracting UMI is important due to their short nucleotide length. I have no other information regarding these data apart from conditions from which the data was generated and fastq file. Evaluating the fastq file, am not sure if UMI was added to them.
Here is example of my data:
more .R1.fastq
@A00124:542:H27NNDSX5:1:1101:19262:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA
NATTAGGGGAGATTTCAACTGTAGGCACCATCAATATTGGATCGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTAATCTCGTATG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA
NTACGTCGAGGATTACCAGCTTGTCAAACTGTAGGCACCATCAATTGCTTGTACTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTA
more .R2.fastq
@A00124:542:H27NNDSX5:1:1101:19262:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA
ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA
TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG
For R1.fastq, has the 3' adapter sequence (AACTGTAGGCACCATCAAT) and Illumina Universal adapter - ‘AGATCGGAAGAG' separated by 12 random nucleotides. The R2.fastq has no such info sequences. Based on this, it seems to me that R1.fastq sequences are barcoded and not R2.fastq, is this right?
I have tried to extract UMI of these paired end sequences using UMI_tools with this code:
umi_tools extract --extract-method=regex --stdin ./Gff.R1.fastq \
--bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.+)' --read2-in=./Gff.R2.fastq \
--stdout ./output/Gff_UMIextracted.R1.fastq --read2-out=./output/Gff_UMIextracted.R2.fastq \
--log ./output/Gff_UMIextracted.log
This code ran successfully, and I got the results as below. After processing.
more R1.fastq
@A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 1:N:0:CCTCTAAGTA+ACTGTAACGA
NATTAGGGGAGATTTC
+
#FFFFFFFFFFFFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 1:N:0:CCTCTAAGTA+ACTGTAACGA
NTACGTCGAGGATTACCAGCTTGTCA
more R2.fastq
@A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 2:N:0:CCTCTAAGTA+ACTGTAACGA
ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 2:N:0:CCTCTAAGTA+ACTGTAACGA
TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG
While there are some changes in R1.fastq file, nothing changed in R2.fastq file.
I am wondering if what I did is correct. I will appreciate if someone can validate or correct me on this.
Thank you.
It helps to know what kit was used to generate these libraries, can you look at the methods section or ask your colleague? The vendor of the kit will often have guidelines for preprocessing the sequencing data.
Well, the person who was in charge of this project left the lab long time ago and not available.
No documentation, labbooks, anything?