I have 2 different samples for which I sequenced the transcriptomes (Illumina): sample A and sample B.
I pooled all the single-end reads (up to 300 bp) from the 2 samples together and did a de novo assembly with CLC.
Now I am trying to retrieve from which sample the contigs (up to 3,000 bp) come from.
To do that I aligned (bowtie2):
- reads from sample A vs indexed contigs
- reads from sample B vs indexed contigs
For each samples independently, I filtered the respective SAM files with samtools (samtools view -F 4 <sample-specific_SAM.tab>
) to obtain the IDs and then the sequences of the contigs for which reads have been mapped to.
As a control, I know that some contigs are specific to each samples (sample A has 50 unique contigs that cannot be found in sample B; sample B has 100 unique contigs that cannot be found in sample A).
However, when I looked at the sample-specific contigs, I can retrieve the 150 unique contigs (50 from A + 100 from B) in both samples.
I tried different bowtie2 parameters, but obtain nearly the same statistics every time (alignment rate; length interval; N50; N90) for both samples:
- using contigs with / without exact duplicate sequences
- end_to_end / local alignment
- allowing gaps or not (to disable gaps I use
--gbar <length of the longest reads>
) - keeping / removing secondary alignments from the SAM files
I don't know what to try anymore. It would be very helpful if someone could share his/her expertise or suggest another strategy.
Thanks!
Are you sure that all those contigs are 100% different from each other? If not, it is still possible for some of the reads from sample A to map onto unique contigs of sample B and vis versa due to small mismatches.
Yes, I also tried by removing the exact duplicate sequences in the contig dataset. But some unique sample A and sample B contigs could be similar. So, what you mean is that the contigs (or at least the read/contig matching regions) are not different enough for segregating them precisely? What about discriminating them by alignment score or other metrics?
Is there a way/metric to assess the similarity between 2 datasets (here transcriptomes)?
So if your sequence are not completely different, there will still be situations where the reads can still align to the other contig. If you want to find the similarity between two sequences, you can use the edit distance to see how similar they are