I have single ended 75 bp miRNA reads (Quiagene miRNA kit) reads with UMI.
The fastqc report shows high peak at the 83-84 bp and illumina universal adaptor.
After removing the 5-3' adaptor ((5’-3’) AACTGTAGGCACCATCAAT) and also reads lower than 17bp with cutadapt, The sequence distribution peak is on 22-23.
I know that miRNA should be around 18-22 and UMI length 12. Doesn't it mean that I should see a peak around 30-34?
The code that I used was:
cutadapt -a AACTGTAGGCACCATCAAT --minimum-length 17 -o tri.fastq sample.fastq
I believe not. The head of fastq file is as follow:
@NB551007:45:HNKVLBGX5:1:11101:18335:1071 1:N:0:GCCAAT CTGGANGCGAGCCAACTGTAGGCACCATCAATNCCGTGCCCTCNAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAAT + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEE#EAEEEEEEEE#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE @NB551007:45:HNKVLBGX5:1:11101:5844:1072 1:N:0:GCCAAT CTGTANGCACCATCAATCGACGTGAACAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGT + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEAEEE< AE/AEEEEEEEEEEEEE @NB551007:45:HNKVLBGX5:1:11101:23470:1072 1:N:0:GCCAAT CGTGGNGAGGAACAATTCTGAGAACTGTAGGCACCATCAATGAACTCGAACCCAGATCGGAAGAGCACACGTCTGAACTCCAGT + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEE @NB551007:45:HNKVLBGX5:1:11101:12496:1074 1:N:0:GCCAAT TCGCTNCGATCTATTGAAAGTCGGCCCTCGACACAAGGGTTTGTAACTGTAGGCACCATCAATTCCCTTATTGCCAGATCGGAA + AAAAA#EEAEAEEEEE6EEEEE/EEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEE
In a downstream analysis I want to use UMI-tools for deduplication. I should actually have the UMI name on the read name to be able to work on it. I searched and looks like I can use fastp to remove the UMI from the read and move it to the read name.
Now my question would be once I have done that, for the trimming with cutadapt, should I remove reads higher than say 40 bp? Just keep 17-40 reads?