Starting With Illumina Paired-End Reads Manipulation
9
3
Entering edit mode
13.8 years ago
Kamila ▴ 70

Hello,

I am new to Illumina sequencing and I am not an advanced user of all those programs that are required to analyse a large sequencing dataset, however I have ~6mln reads and I need to "do" something with them to complete my PhD. Therefore, I would be very grateful if someone could help me and give me some advices.

I have ~6mln of 76-bp paired-end reads - ~3mln in read1 and ~3mln in read2. First thing I did was to check the quality of the reads. I run FastQC program on read1 and read2 and the quality report showed that the reads are good quality, except that there is high sequence duplication level (60%!). I tired to remove duplicated sequences using Galaxy web-tool FASTX-collapse, however the problem is that Galaxy change the original names of the reads and lose /1 and /2 (indicating paired-ends) that will be needed later for assembly and MEGAN programs.

Can anyone help me please?

Kamila

Edit, copied from your answer: Ok, thank you all for interest in my topic. Yes, it is true that I poorly understand what I am doing, but I am a molecular biologist and I don't have degree in bioinformatics/statistics/or any computer related field. I don't want to describe here my situation with my supervisor, I have now two ways out from my situation - give up on my PhD or do everything I can do to finish.

Sorry Michael that I didn't give all of these information, I didn't know that this is so important. Here are my answers:

* Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.

The DNA was isolated from bacteriophages isolated from a sputum sample of the hospital patient.

* Is a single organism that the sample is coming from, or a Meta-genome/transcriptome

It is a metagenome, is will contain all phages/viruses present in that sample.

* What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?

Metagenomic, DNA. * Protocols of nucleotide extraction DNA was extracted using proteinaseK/CTAB protocol and amplified using MDA technique (this could be the reason why there are so many duplicates).

* Is there a reference genome to align the reads to?

My idea is that the reads could be aligned to the reference genome chosen on the basis of the Blast results e.g. if most reads give hit to Steptococcus phage Dp-1, it could be used as the reference genome.

* Or is it a de-novo assembly of the genomic sequence that is required?

de-novo, I already learned how to use Velvet assembler.

Also, I apologise for my poor English.

illumina paired metagenomics next-gen sequencing • 11k views
ADD COMMENT
8
Entering edit mode

I pitty you, really, because this is giving us a desastrous impression of your supervision situation. "do something" with this random data I throw at you, doesn't sound like good understanding of the field. On the other hand, aligning some reads with the help of this forum would not constitute a PhD. I suggest you re-formulate your question by answering all items in my answer below.

ADD REPLY
0
Entering edit mode

What is your application? FastQC will report high sequence duplication for certain applications, this is not necessarily a problem..

ADD REPLY
0
Entering edit mode

It is important to remove duplicates to obtain good quality contigs, also it will reduce files size and time for blast runs.

ADD REPLY
0
Entering edit mode

This is really a question for your Supervisor. In any event, without knowing the source of the reads, there's not much any one can do to help you.

ADD REPLY
0
Entering edit mode

I wish my supervisor could help me with this..

ADD REPLY
0
Entering edit mode

This should be a comment rather than an answer.

ADD REPLY
0
Entering edit mode

It will take too long to blast even 3 million of your reads. You need to use a short read aligner, such as BWA, Maq etc.

ADD REPLY
0
Entering edit mode

uh, virus metagenomics, not really my field, I hope some experts around. I will retag for now

ADD REPLY
3
Entering edit mode
13.8 years ago
Michael 55k

We need to know the following info to help effectively:

  • Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.
  • Is a single organism that the sample is coming from, or a Meta-genome/transcriptome
  • What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?
  • Protocols of nucleotide extraction
  • Is there a reference genome to align the reads to?
  • Or is it a de-novo assembly of the genomic sequence that is required?

You see, there are so many possibilities....

ADD COMMENT
2
Entering edit mode
13.8 years ago
Markp ▴ 40

This seems to be an appropriate response to your supervisor http://www.youtube.com/watch?v=Fl4L4M8m4d0

ADD COMMENT
0
Entering edit mode

+1 cause I really had fun. Lots of my colleagues will feel being understood elsewhere. Even if it does not correspond the policy of this site.

ADD REPLY
0
Entering edit mode

No, it doesn't, but I hope it helps to keep the mood up.

ADD REPLY
2
Entering edit mode
13.8 years ago
Kamila ▴ 70

I just want to say that I solved the problem with duplicates with the program called cd-hit-454. The program is very fast and doesn't change reads names.

ADD COMMENT
0
Entering edit mode

That appears to be specific to 454 generated data, and you stated that you are using Illumina.

ADD REPLY
1
Entering edit mode
13.8 years ago
Kamila ▴ 70

Why should I have a control? There is many publications about metagenomics done without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't analyse the healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD COMMENT
0
Entering edit mode

Well, if removing duplicates is the only thing required maybe, but I had the impression you were asking for more general advice on how to analyse your data?

ADD REPLY
0
Entering edit mode

Yes Michael, I will be very grateful for any ideas.

ADD REPLY
1
Entering edit mode
13.8 years ago
Marina Manrique ★ 1.3k

Since it's a metagenomic project I would try first to characterize the viral populations you have in the sample.

One "simple" thing that I would try would be to blast all those reads against NCBI nt database to annotate them. This kind of massive analysis can be achieved with cloud computing.

Once you have all your reads annotated with known sequences you could maybe study the distribution of populations you have.

If you don't feel capable to do such kind of analysis you could try to collaborate with someone. I'm pretty sure that some people would like to analyze your data.

Other kind of analysis could be trying to assemble all the reads and identify viral genomes. I don't know how difficult this can be (I don't know much about viral metagenomics) but a de novo assembly of metagenomes with Illumina sounds to me more complicated.

Searching quick in pubmed I've found this paper http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2919852/?tool=pubmed maybe it helps you

ADD COMMENT
1
Entering edit mode
13.8 years ago
Mike ▴ 10

Forrest Rohwer has done lots of work analysing phage metagenomic samples, including some in a healthcare setting with Cystic Fibrosis. There are likely to be plenty of analysis approaches that would be appropriate to your work in his publications. http://coralandphage.org/

Good luck ;)

ADD COMMENT
0
Entering edit mode
13.8 years ago
Sarah ▴ 20

Uhhhh...you don't seem to have an experiment there in that you seem to have a test condition (phlem from sick guy) but you don't seem to have a control (phlem from a healthy guy).

I agree that the best bet might be a survey approach, but it might not have much chance of a high level publication without a control (on its own) so don't spend more time on it than it is worth.

Having a bad supervisor prepares one for the real world better than being coddled.

ADD COMMENT
0
Entering edit mode

Why should I have a control? There is many publications about metagenomics without any controls e.g. in "Metagenomic Analysis of Human Diarrhea: Viral Detection and Discovery" they analyse viruses in faeces of hospital patient's with diarrhoea and they don't do healthy individuals faeces as a control.

Funny, but I think I just found answer to my question in this article! In this article they use this program to remove duplicate sequences: http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD REPLY
0
Entering edit mode
13.8 years ago
Kamila ▴ 70

Marina Manrique thank you for your answer. With help of my friend (who is not a scientist) I learned how to run Blast on University server and I have got a preliminary results. I also learned how to use Velvet assembler. Basically, 'everything what seems to be impossible today, it becomes possible tomorrow'. Now, I want to do everything exactly how it should be done, in order to make it publishable and complete my PhD. My supervisor doesn't want to collaborate with anyone, he says that "this is easy". Instead of getting to the end of my PhD I find myself fighting with my supervisor and seeking for help on forum. All I dream of is to finish and find a job within a group doing 'real' metagenomics. But, without any publications I don't have chance to obtain that.

ADD COMMENT
0
Entering edit mode

Good luck then. When you don't know how to go on (and your supervisor doesn't seem to help) look for papers related to your problem and check their material and methods, sometimes they're useful :)

ADD REPLY

Login before adding your answer.

Traffic: 2494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6