Question

prepDE.py read length choice for paired-end sequencing reads

0

Entering edit mode

3.7 years ago

1234gingko ▴ 50

I am doing RNA-seq using StringTie and DESeq2 and DEXSeq. I generated the count files to input to R for the DESeq2 and DEXSeq steps using the recommended python script, prepDE.py. My paired-end read sequencing data are 50 base-pairs on each end (nearly all exact) with about 300bp unsequenced 'inner distance' between the 50bp ends.

The count files, gene_count_matrix.csv and transcript_count_matrix.csv, were generated with read length = 100. But I'm not sure this is correct... The documentation says, "prepDE.py derives hypothetical read counts for each transcript from the coverage values estimated by StringTie for each transcript, by using this simple formula:

reads_per_transcript = coverage * transcript_len / read_len

I believe StringTie would calculate coverage = # bases mapped to locus / size of locus

This would give me a count of 2 per paired-end read that maps to a locus. Is this correct? Does it matter as long as all the counts are calculated this way? Should the read length passed to prepDE.py be based upon the 'insert size' of the sequenced data, or is the 100bp read length the correct number to use? or, should I be using 50 instead ?
Is 100 correct? thanks for any advice

p.s. I edited this post because I have 50bp sequenced ends, not 100bp, but I still have the same question. I followed an example workflow that should be correct, but I can't tell.

data genomics • 1.1k views

ADD COMMENT • link 3.6 years ago by 1234gingko ▴ 50