Hi everyone,
I have Illumina RNAseq data from individuals of 2 groups (2 different phenotypes), after preprocessing reads and running de novo assembly with Trinity, I have group 1st with assemblies sized from 45-55 Mb (megabyte, fasta format) and another group (group 2nd) with assemblies sized from 2-9 Mb. This difference could be due to high level of duplication rate (checked with FASTQC "deduplicate" module) in 2nd group raw read data.
To make sure that these assemblies could be feasible for further analysis (e.g differential expression, SNP discovery) or in an unfortunate case, we have to do it all over again (from library preparation steps), I want to check how large the portion of transcripts (with arbitrary similarity) that were shared between individual transcriptomes of two groups is. Which tool or method could help me do that? Any idea on the usefulness of these data (i.e to which extent we can exploit from this bad data) is also welcomed.
Thank you in advance for your suggestion !
I don't think I quite follow what you're trying to do but could you BLAT one against the other with a specified evalue cutoff and look for the number of contigs with a hit?
Thanks for your suggestion, I will BLAT them to each other. Sorry for my clumsy explanation, for short, I just have RNAseq data from different individuals (same tissue and species), some of them are pretty small sized compared to others after deduplicate (because of PCR artifacts), therefore I would like to examine the portion of similar transcripts they have shared to see if it is possible to continue for further analysis, e.g SNPs discovery.