Hello,
I am new to Illumina sequencing and I am not an advanced user of all those programs that are required to analyse a large sequencing dataset, however I have ~6mln reads and I need to "do" something with them to complete my PhD. Therefore, I would be very grateful if someone could help me and give me some advices.
I have ~6mln of 76-bp paired-end reads - ~3mln in read1 and ~3mln in read2. First thing I did was to check the quality of the reads. I run FastQC program on read1 and read2 and the quality report showed that the reads are good quality, except that there is high sequence duplication level (60%!). I tired to remove duplicated sequences using Galaxy web-tool FASTX-collapse, however the problem is that Galaxy change the original names of the reads and lose /1 and /2 (indicating paired-ends) that will be needed later for assembly and MEGAN programs.
Can anyone help me please?
Kamila
Edit, copied from your answer: Ok, thank you all for interest in my topic. Yes, it is true that I poorly understand what I am doing, but I am a molecular biologist and I don't have degree in bioinformatics/statistics/or any computer related field. I don't want to describe here my situation with my supervisor, I have now two ways out from my situation - give up on my PhD or do everything I can do to finish.
Sorry Michael that I didn't give all of these information, I didn't know that this is so important. Here are my answers:
* Where are the sequences sampled from, describe the organism, sampling site, tissue, etc.
The DNA was isolated from bacteriophages isolated from a sputum sample of the hospital patient.
* Is a single organism that the sample is coming from, or a Meta-genome/transcriptome
It is a metagenome, is will contain all phages/viruses present in that sample.
* What kind of nucleotide (RNA, DNA), is it RNA-seq data, genomic DNA?
Metagenomic, DNA. * Protocols of nucleotide extraction DNA was extracted using proteinaseK/CTAB protocol and amplified using MDA technique (this could be the reason why there are so many duplicates).
* Is there a reference genome to align the reads to?
My idea is that the reads could be aligned to the reference genome chosen on the basis of the Blast results e.g. if most reads give hit to Steptococcus phage Dp-1, it could be used as the reference genome.
* Or is it a de-novo assembly of the genomic sequence that is required?
de-novo, I already learned how to use Velvet assembler.
Also, I apologise for my poor English.
I pitty you, really, because this is giving us a desastrous impression of your supervision situation. "do something" with this random data I throw at you, doesn't sound like good understanding of the field. On the other hand, aligning some reads with the help of this forum would not constitute a PhD. I suggest you re-formulate your question by answering all items in my answer below.
What is your application? FastQC will report high sequence duplication for certain applications, this is not necessarily a problem..
It is important to remove duplicates to obtain good quality contigs, also it will reduce files size and time for blast runs.
This is really a question for your Supervisor. In any event, without knowing the source of the reads, there's not much any one can do to help you.
I wish my supervisor could help me with this..
This should be a comment rather than an answer.
It will take too long to blast even 3 million of your reads. You need to use a short read aligner, such as BWA, Maq etc.
uh, virus metagenomics, not really my field, I hope some experts around. I will retag for now