Question

can I use soft-clip RNA-seq reads to summarize gene counts?

0

Entering edit mode

5.2 years ago

tujuchuanli ▴ 130

Hi, all

I have mapped RNA-seq reads using STAR (human, hg19), then I check the output bam file and find that there are many reads with cigar marked as soft-clip, such as “13S90M47S”, “88M6S” and “7S86M”. Even many of these reads also have very good flag such as “99”, “147” or “83”, “163” which indicate unique mapping.

My questions are:

Why there are many reads which are marked as soft-clip reads. Is it related to the relative low quality?
How do I get rid of these reads? I have tried to trim the 3-prime end of reads according to the reads quality using cutadapt and set qc to 20. However, it doesn`t work.
Is it reasonable to use these unique mapping reads marked as soft-clip to summarize gene counts?

Thanks

RNA-seq • 1.9k views

ADD COMMENT • link 5.2 years ago by tujuchuanli ▴ 130

0

Entering edit mode

The soft-clipped sequences could be either poor quality (so the wrong bases were called) or are contaminating sequences (e.g. adapters or barcodes). Also possible that aligning to the reference just isn't good (i.e. for a given site, the actual sequence of your sample is different from the reference sequence -- especially true for repetitive sequences and structural variation). Did you check the quality on a per-base level? And what does the fastqc output look like? Why don't you look at what the soft clipped bases are -- do they represent a particular sequence?

It seems that there's soft clipping appearing on both the 5' and 3' ends of reads.

If you want to get rid of all soft-clipped alignments, you could just go through .bam and filter out the cigar strings that have the soft-clip flag. But probably best to first figure out what those sequences actually are (manually inspect them and see if it's reasonable to conclude that they are misassigned). Only then can you answer whether it's reasonable to use those sequences for summarizing gene counts.

ADD REPLY • link 5.2 years ago by dsull ★ 7.0k

0

Entering edit mode

Thanks dsull,

Actually, almost of half reads are marked as soft-clip. I could lost half if I do not use these reads to summarize gene counts which I can hardly afford. The qc is somehow not very good, especially for the 3-prime end of reads. however I trim the end by qc 20 using cutadapt which didn`t change too much. I am also curious about the soft-clip appearing on both 5- and 3-prime of reads.

Do you have any suggestion to get rid of these soft-clip by some kinds of trimming method?

ADD REPLY • link 5.2 years ago by tujuchuanli ▴ 130

0

Entering edit mode

The STAR aligner prefers in standard settings rather to soft-clip reads than assign mismatches at the ends. There is a setting to enforce end-to-end mapping --alignEndsType EndToEnd.

I also see often soft-cliped reads and use them as they are reported. As the soft-cliped part is not participating in the alignment.

ADD REPLY • link 5.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Thanks, michael.ante

I have tried mapping with --alignEndsType EndToEnd and it end up dramatically decreasing the mapping percentage.

ADD REPLY • link 5.2 years ago by tujuchuanli ▴ 130