Generating a gene expression matrix from Picard *duplicates_marked.bam file
0
0
Entering edit mode
6.4 years ago
halo22 ▴ 300

Hello All,

I am new to RNA sequencing analysis and trying to figure out what could be a better way to generate gene-expression matrix for my analysis with Picard output. I received the RNA dataset from a collaborator who has an internal pipeline that does the primary QC, alignment with STAR and sequence manipulation with Picard, the samples are processed for every sample well(N=384). I don't have access to the original aligned BAM files, I am provided with the bam files that are processed with the Picard utility(UmiAwareMarkDuplicatesWithMateCigar) to mark duplicates, these files are labelled as *.aligned.duplicates_marked.bam. What would be the best way to generate a gene expression matrix for further analysis using these files? Should I remove the duplicates using "samtools view -b -F 0x400 mytest.bam > mytest_removed.bam" and then process new files with featureCounts or htseq?

Appreciate all help and suggestions.

rna-seq • 1.4k views
ADD COMMENT
0
Entering edit mode

In general removing duplicates in RNAseq is not recommended/necessary: Markduplicates in RNASEQ and https://bioinformatics.stackexchange.com/questions/2282/removing-pcr-duplicates-in-rna-seq-analysis

That said since your libraries appear to have UMI I am not sure how Picard had handled those when marking duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6