Question

Read length for single -ended RNA seq data analysis

0

Entering edit mode

17 months ago

vinaya • 0

Hi! all,

I am fairly new to the RNA seq data analysis, In order to perform DEG between two conditions, I have 3 samples per each condition(single ended) all of them have common problems as per FAST QC viz, failure in per base sequence content and some of them failed in duplication level and warning in per tile sequence quality, in order to remove per base sequence content I trimmed first 10 bases and filtered reads below length of 15, which generated reads of length 15-41 (before trimming read length was 51) trimming removed per base sequence content and per sequence quality problems , but the read length is reduced , I have read somewhere in this space that optimal length for single ended fastq reads for DEG would be 50, what do you suggest?

In another case to perform DEG again I have 3 samples per each dataset (single ended again),read length was 101 and FAST QC problems were adaptor content, over-represented sequences, per base sequence quality drop in bases towards the end, per base sequence content failure , GC content warning/failure/good in some cases, now I removed adaptors ,trimmed first 10 bases and removed bases below quality 20 and trimmed reads below length 15, it resulted in reads with length 15-91 ,removal of adaptor better GC plot no change in duplication removal of per base sequence content failure, but is this length of 15-41 and 15-91 okay to proceed with alignment? Kindly suggest.

read-length fastq • 1.5k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 17 months ago by vinaya • 0

1

Entering edit mode

I would suggest rather than worrying about these things that likely won't even make a difference, just proceed with your read mapping/alignment.

If you think all these complex trimming choices will actually make a difference, you can do your analysis with and without trimming and check if it actually does make a difference.

ADD REPLY • link 17 months ago by dsull ★ 7.0k

0

Entering edit mode

I agree you shouldn't worry too much before aligning and analyzing. With RNA-seq usually you use an aligner that will effectively trim your reads during alignment (soft-clip, local alignment), like default STAR behavior. In contrast, if you were doing an end-to-end alignment (usually with DNA-based seq), then trimming may be more important to increase mapping rates. Also, for other specific analysis this might be important (maybe repetitive regions / very short fragments).

I usually don't worry too much when flags show up with fastq QC data. I care more that within the experiment, there are no outlier sample in terms of overrepresented sequences or GC bias etc. For example, if one sample shows very high duplication (higher duplication rate is normal for RNA-seq, but I mean like 90%), but the other samples are lower, I might assume that library had an issue and may have low complexity. I will still map and analyze and this sample would usually show bad QC data after analysis as well.

ADD REPLY • link 17 months ago by rfran010 ★ 1.3k

0

Entering edit mode

so, should I not trim the adaptors too? what would be the suitable aligner that would trim the reads ?Is STAR not suitable?

ADD REPLY • link 17 months ago by vinaya • 0

1

Entering edit mode

STAR is fine, kallisto is fine, etc. Just run your analysis. In the few hours you spent asking questions on biostars obsessing over details that don't really matter, your analysis could have been completed already :)

ADD REPLY • link 17 months ago by dsull ★ 7.0k

0

Entering edit mode

Thanks , I did that already. post trimming the mapping of unique reads is 72.83% but pre trimming the mapping of unique reads is 80.3% in the case of reads of length 101.But, the multimapped reads percent is 18.83? what to do with this ?

ADD REPLY • link 17 months ago by vinaya • 0

0

Entering edit mode

Looks fine and makes sense. Just go ahead and do gene expression analysis of your results now (make some PCA plots, get your log fold changes and p-values, etc.)

ADD REPLY • link 17 months ago by dsull ★ 7.0k

0

Entering edit mode

My annotation failed it is showing NA , what do I do? I used hg38 for reference and GTF files , but in the paper of my datasets , the author used hg 19 from UCSC, did it fail because of different reference genome

ADD REPLY • link 17 months ago by vinaya • 0

0

Entering edit mode

You haven't explained what is NA, what paper you're referring to, or even what bioinformatics tools/commands you're running. Unfortunately, we can't help you if you don't provide sufficient information.

Also, doing your analysis is completely different than your original question about read length -- I suggest you create an entirely new thread on Biostars to ask your new question; you'll be able to get more support that way. :)

ADD REPLY • link 17 months ago by dsull ★ 7.0k

0

Entering edit mode

okay . By NA , I meant not applicable. I used annotate Deseq2 output table tool to annotate. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0768-0, am following the RNA seq analysis mentioned in this paper.

ADD REPLY • link 17 months ago by vinaya • 0