Question

Generating a gene expression matrix from Picard *duplicates_marked.bam file

0

Entering edit mode

6.4 years ago

halo22 ▴ 300

Hello All,

I am new to RNA sequencing analysis and trying to figure out what could be a better way to generate gene-expression matrix for my analysis with Picard output. I received the RNA dataset from a collaborator who has an internal pipeline that does the primary QC, alignment with STAR and sequence manipulation with Picard, the samples are processed for every sample well(N=384). I don't have access to the original aligned BAM files, I am provided with the bam files that are processed with the Picard utility(UmiAwareMarkDuplicatesWithMateCigar) to mark duplicates, these files are labelled as *.aligned.duplicates_marked.bam. What would be the best way to generate a gene expression matrix for further analysis using these files? Should I remove the duplicates using "samtools view -b -F 0x400 mytest.bam > mytest_removed.bam" and then process new files with featureCounts or htseq?

Appreciate all help and suggestions.

rna-seq • 1.4k views

ADD COMMENT • link 6.4 years ago by halo22 ▴ 300

0

Entering edit mode

In general removing duplicates in RNAseq is not recommended/necessary: Markduplicates in RNASEQ and https://bioinformatics.stackexchange.com/questions/2282/removing-pcr-duplicates-in-rna-seq-analysis

That said since your libraries appear to have UMI I am not sure how Picard had handled those when marking duplicates.

ADD REPLY • link 6.4 years ago by GenoMax 148k