I am using STAR to map ChIP-seq paired-end reads and then Picard MarkDuplicates to remove duplicates. My problem is getting Picard to recognize the read names correctly. Here is an example of a read in one of the paired-end fastq files:
@SRR1463165.1 HWI-ST740:1:D0TMMACXX:5:1101:1162:2049/1
NATTNNAAAAGAATCACTAAGAGTTTTACAAAATTGGTTTTTAAAATGTTA
+
#089##2<985=8?<<<>>?<<@;:>8;>??<@?<8>=<??9??=???)=?
After mapping with star, the reads in my bam file look like:
SRR1463165.62872900 99 1 10060 60 51M = 10355 346 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA 11:A+=A?DD?DD;C@EEDE39;<CC?B>E8:?)???:)9??@B9;;;B## NH:i:1 HI:i:1 AS:i:98 nM:i:1 RG:Z:CXH
The problem arises when I try to remove duplicates. I get the warning message
Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'SRR1463164.80376006'.
What STAR parameter do I change to output the required information in the read name? Picard requires that the read name contain three variables (tile/region, x coordinate, y coordinate). STAR has a --outSAMreadID
parameter but it's options don't allow for me to customize a read name that's appropriate for Picard. Here is the current STAR command I'm running.
STAR --runThreadN 40 \
--genomeDir star_index \
--readFilesCommand gzip -cd \
--readFilesIn ${data}_1.fastq.gz ${data}_2.fastq.gz \
--outFilterMultimapNmax 1 \
--outFilterMismatchNmax 5 \
--alignIntronMax 1 \
--alignEndsType EndToEnd \
--outSAMmapqUnique 60 \
--outSAMattrRGline ID:CXH SM:sample \
--outSAMtype BAM SortedByCoordinate \
--outStd BAM_SortedByCoordinate > ${data}.bam
Thanks for the follow up.
Brilliant work!! Actually I'm having the same trouble as exactly described here. It'd be very nice of you if you would like to share the code with me. Many tks in advance!!