Question

How to retrieve the sample origin of contigs after assembly of pooled reads?

1

Entering edit mode

9.6 years ago

lqdo2000 ▴ 20

I have 2 different samples for which I sequenced the transcriptomes (Illumina): sample A and sample B.

I pooled all the single-end reads (up to 300 bp) from the 2 samples together and did a de novo assembly with CLC.

Now I am trying to retrieve from which sample the contigs (up to 3,000 bp) come from.

To do that I aligned (bowtie2):

reads from sample A vs indexed contigs
reads from sample B vs indexed contigs

For each samples independently, I filtered the respective SAM files with samtools (samtools view -F 4 <sample-specific_SAM.tab>) to obtain the IDs and then the sequences of the contigs for which reads have been mapped to.

As a control, I know that some contigs are specific to each samples (sample A has 50 unique contigs that cannot be found in sample B; sample B has 100 unique contigs that cannot be found in sample A).

However, when I looked at the sample-specific contigs, I can retrieve the 150 unique contigs (50 from A + 100 from B) in both samples.

I tried different bowtie2 parameters, but obtain nearly the same statistics every time (alignment rate; length interval; N50; N90) for both samples:

using contigs with / without exact duplicate sequences
end_to_end / local alignment
allowing gaps or not (to disable gaps I use --gbar <length of the longest reads>)
keeping / removing secondary alignments from the SAM files

I don't know what to try anymore. It would be very helpful if someone could share his/her expertise or suggest another strategy.

Thanks!

Assembly reads contigs alignment • 3.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by lqdo2000 ▴ 20

0

Entering edit mode

Are you sure that all those contigs are 100% different from each other? If not, it is still possible for some of the reads from sample A to map onto unique contigs of sample B and vis versa due to small mismatches.

ADD REPLY • link 9.6 years ago by Sam ★ 4.8k

0

Entering edit mode

Yes, I also tried by removing the exact duplicate sequences in the contig dataset. But some unique sample A and sample B contigs could be similar. So, what you mean is that the contigs (or at least the read/contig matching regions) are not different enough for segregating them precisely? What about discriminating them by alignment score or other metrics?

ADD REPLY • link 9.6 years ago by lqdo2000 ▴ 20

0

Entering edit mode

Is there a way/metric to assess the similarity between 2 datasets (here transcriptomes)?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by lqdo2000 ▴ 20

0

Entering edit mode

So if your sequence are not completely different, there will still be situations where the reads can still align to the other contig. If you want to find the similarity between two sequences, you can use the edit distance to see how similar they are

ADD REPLY • link 9.6 years ago by Sam ★ 4.8k