Question

Starting With Illumina Paired-End Reads Manipulation

3

Entering edit mode

13.8 years ago

Kamila ▴ 70

Hello,

I am new to Illumina sequencing and I am not an advanced user of all those programs that are required to analyse a large sequencing dataset, however I have ~6mln reads and I need to "do" something with them to complete my PhD. Therefore, I would be very grateful if someone could help me and give me some advices.

I have ~6mln of 76-bp paired-end reads - ~3mln in read1 and ~3mln in read2. First thing I did was to check the quality of the reads. I run FastQC program on read1 and read2 and the quality report showed that the reads are good quality, except that there is high sequence duplication level (60%!). I tired to remove duplicated sequences using Galaxy web-tool FASTX-collapse, however the problem is that Galaxy change the original names of the reads and lose /1 and /2 (indicating paired-ends) that will be needed later for assembly and MEGAN programs.

Can anyone help me please?

Kamila

Edit, copied from your answer: Ok, thank you all for interest in my topic. Yes, it is true that I poorly understand what I am doing, but I am a molecular biologist and I don't have degree in bioinformatics/statistics/or any computer related field. I don't want to describe here my situation with my supervisor, I have now two ways out from my situation - give up on my PhD or do everything I can do to finish.

Sorry Michael that I didn't give all of these information, I didn't know that this is so important. Here are my answers:

* Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.

The DNA was isolated from bacteriophages isolated from a sputum sample of the hospital patient.

* Is a single organism that the sample is coming from, or a Meta-genome/transcriptome

It is a metagenome, is will contain all phages/viruses present in that sample.

* What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?

Metagenomic, DNA. * Protocols of nucleotide extraction DNA was extracted using proteinaseK/CTAB protocol and amplified using MDA technique (this could be the reason why there are so many duplicates).

* Is there a reference genome to align the reads to?

My idea is that the reads could be aligned to the reference genome chosen on the basis of the Blast results e.g. if most reads give hit to Steptococcus phage Dp-1, it could be used as the reference genome.

* Or is it a de-novo assembly of the genomic sequence that is required?

de-novo, I already learned how to use Velvet assembler.

Also, I apologise for my poor English.

illumina paired metagenomics next-gen sequencing • 11k views

ADD COMMENT • link updated 13.8 years ago by Mike ▴ 10 • written 13.8 years ago by Kamila ▴ 70

8

Entering edit mode

I pitty you, really, because this is giving us a desastrous impression of your supervision situation. "do something" with this random data I throw at you, doesn't sound like good understanding of the field. On the other hand, aligning some reads with the help of this forum would not constitute a PhD. I suggest you re-formulate your question by answering all items in my answer below.

ADD REPLY • link 13.8 years ago by Michael 55k

0

Entering edit mode

What is your application? FastQC will report high sequence duplication for certain applications, this is not necessarily a problem..

ADD REPLY • link 13.8 years ago by User 59 13k

0

Entering edit mode

It is important to remove duplicates to obtain good quality contigs, also it will reduce files size and time for blast runs.

ADD REPLY • link 13.8 years ago by Kamila ▴ 70

0

Entering edit mode

This is really a question for your Supervisor. In any event, without knowing the source of the reads, there's not much any one can do to help you.

ADD REPLY • link 13.8 years ago by User 3084 ▴ 10

0

Entering edit mode

I wish my supervisor could help me with this..

ADD REPLY • link 13.8 years ago by Kamila ▴ 70

0

Entering edit mode

This should be a comment rather than an answer.

ADD REPLY • link 13.8 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

It will take too long to blast even 3 million of your reads. You need to use a short read aligner, such as BWA, Maq etc.

ADD REPLY • link 13.8 years ago by User 3084 ▴ 10

0

Entering edit mode

uh, virus metagenomics, not really my field, I hope some experts around. I will retag for now

ADD REPLY • link 13.8 years ago by Michael 55k

score 3 · Answer 1 · 2011-02-10

We need to know the following info to help effectively:

Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.
Is a single organism that the sample is coming from, or a Meta-genome/transcriptome
What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?
Protocols of nucleotide extraction
Is there a reference genome to align the reads to?
Or is it a de-novo assembly of the genomic sequence that is required?

You see, there are so many possibilities....

score 2 · Answer 2 · 2011-02-10

2

Entering edit mode

13.8 years ago

Markp ▴ 40

This seems to be an appropriate response to your supervisor http://www.youtube.com/watch?v=Fl4L4M8m4d0

ADD COMMENT • link 13.8 years ago by Markp ▴ 40

0

Entering edit mode

+1 cause I really had fun. Lots of my colleagues will feel being understood elsewhere. Even if it does not correspond the policy of this site.

ADD REPLY • link 13.8 years ago by toni ★ 2.2k

0

Entering edit mode

No, it doesn't, but I hope it helps to keep the mood up.

ADD REPLY • link 13.8 years ago by Michael 55k

score 2 · Answer 3 · 2011-02-10

2

Entering edit mode

13.8 years ago

Kamila ▴ 70

I just want to say that I solved the problem with duplicates with the program called cd-hit-454. The program is very fast and doesn't change reads names.

ADD COMMENT • link 13.8 years ago by Kamila ▴ 70

0

Entering edit mode

That appears to be specific to 454 generated data, and you stated that you are using Illumina.

ADD REPLY • link 11.6 years ago by xapple ▴ 230

score 1 · Answer 4 · 2011-02-10

1

Entering edit mode

13.8 years ago

Kamila ▴ 70

Why should I have a control? There is many publications about metagenomics done without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't analyse the healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD COMMENT • link 13.8 years ago by Kamila ▴ 70

0

Entering edit mode

Well, if removing duplicates is the only thing required maybe, but I had the impression you were asking for more general advice on how to analyse your data?

ADD REPLY • link 13.8 years ago by Michael 55k

0

Entering edit mode

Yes Michael, I will be very grateful for any ideas.

ADD REPLY • link 13.8 years ago by Kamila ▴ 70

score 1 · Answer 5 · 2011-02-10

Since it's a metagenomic project I would try first to characterize the viral populations you have in the sample.

One "simple" thing that I would try would be to blast all those reads against NCBI nt database to annotate them. This kind of massive analysis can be achieved with cloud computing.

Once you have all your reads annotated with known sequences you could maybe study the distribution of populations you have.

If you don't feel capable to do such kind of analysis you could try to collaborate with someone. I'm pretty sure that some people would like to analyze your data.

Other kind of analysis could be trying to assemble all the reads and identify viral genomes. I don't know how difficult this can be (I don't know much about viral metagenomics) but a de novo assembly of metagenomes with Illumina sounds to me more complicated.

Searching quick in pubmed I've found this paper http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2919852/?tool=pubmed maybe it helps you

score 1 · Answer 6 · 2011-02-15

1

Entering edit mode

13.8 years ago

Mike ▴ 10

Forrest Rohwer has done lots of work analysing phage metagenomic samples, including some in a healthcare setting with Cystic Fibrosis. There are likely to be plenty of analysis approaches that would be appropriate to your work in his publications. http://coralandphage.org/

Good luck ;)

ADD COMMENT • link 13.8 years ago by Mike ▴ 10

score 0 · Answer 7 · 2011-02-10

0

Entering edit mode

13.8 years ago

Sarah ▴ 20

Uhhhh...you don't seem to have an experiment there in that you seem to have a test condition (phlem from sick guy) but you don't seem to have a control (phlem from a healthy guy).

I agree that the best bet might be a survey approach, but it might not have much chance of a high level publication without a control (on its own) so don't spend more time on it than it is worth.

Having a bad supervisor prepares one for the real world better than being coddled.

ADD COMMENT • link 13.8 years ago by Sarah ▴ 20

0

Entering edit mode

Why should I have a control? There is many publications about metagenomics without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't do healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD REPLY • link 13.8 years ago by Kamila ▴ 70

score 0 · Answer 8 · 2011-02-11

Marina Manrique thank you for your answer. With help of my friend (who is not a scientist) I learned how to run Blast on University server and I have got a preliminary results. I also learned how to use Velvet assembler. Basically, 'everything what seems to be impossible today, it becomes possible tomorrow'. Now, I want to do everything exactly how it should be done, in order to make it publishable and complete my PhD. My supervisor doesn't want to collaborate with anyone, he says that "this is easy". Instead of getting to the end of my PhD I find myself fighting with my supervisor and seeking for help on forum. All I dream of is to finish and find a job within a group doing 'real' metagenomics. But, without any publications I don't have chance to obtain that.