Hi folks,
I have a de novo transcriptome assembly of a polyploid tree species assembled with K=31 and min length 200 bp. The assembly contains almost 400K genes, and after a reduction with CD-HIT-EST (cut-off=0.97), I have around 350K genes left. Mapping ca. 1/4 of total reads back to the assembly showed the majority of the reads align > 1 times. Do you think would it pose a problem if I aim to work at a gene level? I can try cd-hit-est with cut-off=0.95. Or is it better to use Lace to stitch different isoforms together and take it from there?
Thank you very much in advance for your suggestions and comments!
$ bowtie2 --local --no-unal -x cdhit_e97_Trinity_Famer_K31 -p 24 -q -1 cat_70x_R1.fq.gz -2 cat_70x_R2.fq.gz | samtools view -b | samtools sort -o 70x_bowtie2.bam
78850917 reads; of these:
78850917 (100.00%) were paired; of these:
2584872 (3.28%) aligned concordantly 0 times
11430984 (14.50%) aligned concordantly exactly 1 time
64835061 (82.22%) aligned concordantly >1 times
----
2584872 pairs aligned concordantly 0 times; of these:
201798 (7.81%) aligned discordantly 1 time
----
2383074 pairs aligned 0 times concordantly or discordantly; of these:
4766148 mates make up the pairs; of these:
627598 (13.17%) aligned 0 times
410463 (8.61%) aligned exactly 1 time
3728087 (78.22%) aligned >1 times
99.60% overall alignment rate
[bam_sort_core] merging from 80 files and 1 in-memory blocks...