I have RNASeq data from a de novo transcriptome assembly. The data is not given to me in SAM/BAM files, but instead in some kind of proprietary format where each contig is represented nucleotide by nucleotide, and for each nucleotide you are given the number of hits for this nucleotide over the samples in the experiment. So to sketch it:
Contig6
Nuc Sample1 Sample 2 Sample 3 Sample 4
A 1 0 3 4
C 21 15 18 17
This describes Contig6, for which the assembled sequence was "AC". In order to get from here to a matrix which describes the number of reads per contig for each sample, my first reaction is to sum each column and normalize by the sequence length, then round to nearest integer, yielding
rowname Sample 1 Sample 2 Sample 3 Sample 4
Contig6 11 7 10 10
This example is made up, the real contigs are several hundreds to thousands of nucleotides long, but that doesn't change the approach.
Is there any consensus about how to do this correctly? I'm fairly sure it's introducing some sort of bias into the counts on several occasions.
Thanks a lot!
Is it possible for you to just get the raw data? That'd be vastly easier to deal with than some made-up that someone dreamed up.
Hey Devon, this is literally what my data looks like, I have no access to the actual assembly.