Hello Everybody,
I was trying to process my NGS data before further analysis, however, there was something problem in my data I could not understand. For your information, I downloaded public paired-end Illumina HiSeq 2000 NGS data in FastQ format. In the Fastq directory, I have two folder DRR003655_1.fastq.bz2 and DRR003655_2.fastq.bz2.
I assume that DRR003655_1.fastq.bz2 is for forward read sequence and DRR003655_2.fastq.bz2 for reverse read sequence folder of illumina paired end sequence data. After quality control with FastQc, I found pretty good data in both of the folder that means my reads have no contamination with adapter sequence and all sequence reads are same base pair length. Therefore I didn't use any further manipulation like adapter removal or trimmomatic tools.
I directly convert my both of the sequence into fasta file and uploaded in Galaxy for my further analysis.
However, my problem is when I was trying to interlace two fasta file, using the the galaxy tools fasta interlacer selecting left hand mate and right-hand mate (DRR003655_1.fasta and DRR003655_2.fasta) and execute the command The galaxy shows some error in my data.
It warns that the program could not find the pair read mate and there is a problem in my data. I check the name of the read it is found for DRR003655_1.fastq.bz2
the first read name DRR003655.1 FCD0RCJACXX:5:1101:1180:2119
and for DRR003655_2.fastq.bz2
the first read name DRR003655.1 FCD0RCJACXX:5:1101:1180:2119
However, I tried to use the tools remane sequence for both of the fasta file and interlace them again but it shows the same problem. Interestingly when I use fasta Joiner tools to join both of the file the program gave me a significant number of joined reads with some single reads. I really don't understand what is going on in my reads sequence. To mention here i am quite new in bioinformatics and just trying to learn some basic bioinformatics tools using galaxy.
Can anyone tell me what is the problem here and how can I solve that problem? My ultimate goal of this analysis is to simply interlace the both paired end read data, after that using sequence sampling I want to narrow down my sequence read and do some further analysis.
All comment and help are highly appreciating:))
First problem is you converted your fastq files into fasta format. This is rarely required and you actually lost information about quality scores for the bases in the process. Analyze the data as fastq, where possible.
Following assumes that you are able to use the command line (on any OS, with Java available). While it is not completely clear what you are trying to do eventually you can use
reformat.sh
program from BBMap suite to interleave your paired-end reads like:reformat.sh in1=DRR003655_1.fastq.bz2 in2=DRR003655_2.fastq.bz2 out=DRR003655_int.fastq.bz2 verifypaired=t
(I thinkverifypaired
flag should work to test if your reads are in the proper order in your two files). You may be able to sample reads in the same step (take a look at the help for sampling parameters forreformat.sh
).Yes I also thinked that when I converted my reads it changes something but simply in my mind I want to reduce the size of my files. There is some problem in my FTP file transfer suite. I couldn't connect to that. But as I remember I also tried with fastq file with less read sequence after extracting some portion of my sequence read data but it again shows the same problem. But may be this time I will try SRA tool kit to extracts the read sequence.
About your BBMap suite I will try it today. Could you please tell me that is it possible to use it in windows. I don't have any experience on BBMap suite and is it difficult to install it??
Many many thanks for your reply:)) and comment:))
You can use BBMap on windows as long as you install Java. There is no installation needed for BBMap. Download the software, uncompress and use. Take a look at this SeqAnswers thread for lots of help with BBMap (windows execution requires a slightly different syntax than one I included above). Ask if you run into any issues.
What's your actual end goal? There's rarely a reason to convert fastq->fasta, for example.
Dear Devon,
My end goal is to run the data in Repeat Explorer pipeline:))
Does the galaxy server you're using offer jupyter as an interactive environment? The error you're getting is because of the read names in the two files, which for some unknown reason the interlacer tool doesn't seem to be handling properly. The easiest solution, then, is to just not use that tool, but rather something else. It's pretty trivial to write a read interlacer, so if your galaxy instance supports jupyter as an interactive environment I can hack together some code.
Dear Devon,
I really don't know what environment that it follows. As because I am not a good bioinformatician like you people. And also I am not sure which way it can solve the problem:((