Hello All,
I am new to RNA sequencing analysis and trying to figure out what could be a better way to generate gene-expression matrix for my analysis with Picard output. I received the RNA dataset from a collaborator who has an internal pipeline that does the primary QC, alignment with STAR and sequence manipulation with Picard, the samples are processed for every sample well(N=384). I don't have access to the original aligned BAM files, I am provided with the bam files that are processed with the Picard utility(UmiAwareMarkDuplicatesWithMateCigar) to mark duplicates, these files are labelled as *.aligned.duplicates_marked.bam. What would be the best way to generate a gene expression matrix for further analysis using these files? Should I remove the duplicates using "samtools view -b -F 0x400 mytest.bam > mytest_removed.bam" and then process new files with featureCounts or htseq?
Appreciate all help and suggestions.
In general removing duplicates in RNAseq is not recommended/necessary: Markduplicates in RNASEQ and https://bioinformatics.stackexchange.com/questions/2282/removing-pcr-duplicates-in-rna-seq-analysis
That said since your libraries appear to have UMI I am not sure how Picard had handled those when marking duplicates.