Hi Biostars!
I am processing PolyA site sequenced with long read data (both cDNA ONT, PacBio) using minimap2. I am asking you the following questions as I want to make sure I am doing the right thing when processing them.
I would like to align polyA site sequenced with LR using minimap2. Minimap has a lot of different flags. What parameters should I use for minimap for both ont and pacbio? I think splice:hq is used for pacbio and splice is used for ont but to your knowledge, what is a good parameter to run minimap given fastq file?
After minimap, is there some kind of parameter I should be checking about statistics for the sample to make sure minimap is doing the right thing? If it’s doing the right thing, my guess is minimap throws out anything fishy. How do I know minimap is doing good or bad? Is there any stats I should plot to see that? I am generally asking about what quality check needs to be done for long read and what kind of stats would be good to see and plot to make sure minimap2 is doing the right thing!
Thank you!
Can you clarify what this means exactly?
It's just long read fastq files focused on 3' end instead of full isoform to see different PolyA sites
Sounds like you must have already tried aligning the data. What did the result look like? I would think that the poly-A's should be soft-clipped and data aligned normally even with the default parameters.
Oh I mean I wasn't sure which parameter is optimized for ONT and PacBio. I did not try them yet and only found the below command options are commonly used. Are these enough and are there any quality check required afterwards?
ONT minimap2 -ax splice
PacBio minimap2 -ax splice:hq -uf
Presets already defined in
minimap2
should be good to start with.If your data is of good quality then there should be no issues with alignment percentages. That would be the first thing to check. You can also check on coverage using a program like
pandepth
and a BED file for gene models.Thank you GenoMax!
As it's focused on polyA, using soft-clipped flag, k-mer 14 for ONT, k-mer 15 for PacBio would be good.
After minimap2, my file would be a bam file. Then, do you recommend checking MAPQ >= 30 and coverage using tool like samtools, pandepth? If I check the coverage, what is a good coverage? I see having MAPQ >= 30 shows reads are mapped uniquely or with high certainty to the reference genome but not sure what I can gain from coverage
Confidence that a particular site is real. If you see more than a certain number of reads piling up there.
Thank you GenoMax! For my specific data, if I want to capture PolyA site, does it make sense to group them by tissues so I don’t have to reprocess everything from scratch? I would guess the answer is yes, but would like to hear your opinion! I am using IsoQuant to process minimap2 output to discover & quantify isoforms