-Download the reads from the SRA archive in the merged format (failed)
-Merge the reads using pandaseq (done)
-Use fastq-dump --split function (impossible to install SRA toolkit correctly)
Since I'm having some problem with the reads merged using pandaseq i want to try to use other strategies.
Do you have any suggestions?
Also, do you know how to download SRA-toolkit in the correct way?
Is it possible to use just sudo command in ubuntu instead of downloading the zipped folder?
What do you mean by "merge", like append to each other, or interleave so alternating order R1/R2/R1/R2... in one file? You can download data directly as fastq from sra-explorer.info by the way. Please add some details.
by 'merge' i mean append to each other (F and R) and not in alternating order R1/R2 ecc..
This will be the input file for the pipeline and it's mandatory to use just two fastq files from male (m.fastq) and female (f.fastq) samples (not 4, m.F.fastq-m.R.fastq ; f.F.fastq-f.R.fastq).
I already used sra-explorer.info but it's the same as downloading the file from the ENA archive...
Do you ever used USEARCH or PEAR to combine F and R fastq file?
Hi!
So, I'm still a little confused on the format that you want to have the reads. There are two possible options:
1) Do you need the reads concatenated on a single file? For example, if you have 10,000 reads, you want to generate a file that will have 20,000 reads (the first half with the forward reads, the second half with the reverse reads).
If this is the case, you can concatenate the two files using cat (or zcat if they are compressed).
2) Or, do you need each read pair merged? I think this is what you are trying to achieve by using Pandaseq, which will take each pair of forward and reverse read, find some overlap between them, and merged them. This means that if you have 10,000 forward and 10,000 reverse reads, in theory, you will obtain 10,000 merged reads...assuming that all of the reads have enough overlap to merge them. If you are doing this, you need to be sure that the reads actually have some overlap between them, if not you will not obtain too many reads after
Without knowing exactly what is the input for redkmer, I think what you need in this case is option 1, to concatenate your read files.
'For the short-read libraries, data must be generated from both male and female samples independently and pro-vided in fastq format as a single file (paired-end reads can be merged into one file for each sex).
I will try bot the strategies (cat and pandaseq/SRA-toolkit).
When i merged the file using pandaseq the input sizes were:
I think that pandaseq worked properly since most of the reads were in the merged file.
Do you ever use paired-end reads merged with zcat ( F and R ) for alignments?
I was just worried about the size of the input file if i use zcat (will be more that 60gb).
It is definitely possible to install SRA toolkit correctly. The download page is here and installation instructions are here. It boils down to: 1) downloading correct binaries; 2) unpacking the archive; 3) adding archive's bin directory to $PATH variable.
Alternatively, after unpacking the archive the contents of the bin directory can be moved to another directory that is already part of $PATH for your account. Type echo $PATH to find out what directories are already included. A partial list of my $PATH directories looks like this:
You could move the binaries to any of the directories separated by colons, but not all of them are meant for random programs. For example, from the SRA toolkit's bin directory you could issue this command:
sudo mv * /usr/local/bin
After that you may need to log in and out or open a new terminal window, and typing which fastq-dump should output something like /usr/local/bin/fastq-dump. From that point on it is a matter of reading about program's options and downloading the files as interleaved.
After a while, I managed to install SRA-toolkit properly in the correct directory following your suggestions.
I also did the setting-up of the 'Quick Toolkit Configuration' in order to allow remote access.
However, when I try to download the reads from the accession number using fastq-dump I have this error:
Failed to call external services.
I used the prefetch command and it worked so I don't know what is going on......
DO you have any other suggestions?
DO you know if is possible to install the SRA-toolkit in an HPC system?
What do you mean by "merge", like append to each other, or interleave so alternating order R1/R2/R1/R2... in one file? You can download data directly as fastq from
sra-explorer.info
by the way. Please add some details.Thank you for replying,
by 'merge' i mean append to each other (F and R) and not in alternating order R1/R2 ecc..
This will be the input file for the pipeline and it's mandatory to use just two fastq files from male (m.fastq) and female (f.fastq) samples (not 4, m.F.fastq-m.R.fastq ; f.F.fastq-f.R.fastq).
I already used sra-explorer.info but it's the same as downloading the file from the ENA archive...
Do you ever used USEARCH or PEAR to combine F and R fastq file?
This is an example of what i did using pandaseq:
pandaseq -F -f SRR1509742_1.fastq.gz -r SRR1509742_2.fastq.gz -d rbfkms -u unmerged_pandaseq.fa 2> pandastat.txt 1> merged_mandaseq_pacbio.fastq
Hi! So, I'm still a little confused on the format that you want to have the reads. There are two possible options:
1) Do you need the reads concatenated on a single file? For example, if you have 10,000 reads, you want to generate a file that will have 20,000 reads (the first half with the forward reads, the second half with the reverse reads).
If this is the case, you can concatenate the two files using cat (or zcat if they are compressed).
2) Or, do you need each read pair merged? I think this is what you are trying to achieve by using Pandaseq, which will take each pair of forward and reverse read, find some overlap between them, and merged them. This means that if you have 10,000 forward and 10,000 reverse reads, in theory, you will obtain 10,000 merged reads...assuming that all of the reads have enough overlap to merge them. If you are doing this, you need to be sure that the reads actually have some overlap between them, if not you will not obtain too many reads after
Without knowing exactly what is the input for redkmer, I think what you need in this case is option 1, to concatenate your read files.
This is what they say in the paper:
'For the short-read libraries, data must be generated from both male and female samples independently and pro-vided in fastq format as a single file (paired-end reads can be merged into one file for each sex).
I will try bot the strategies (cat and pandaseq/SRA-toolkit).
When i merged the file using pandaseq the input sizes were:
maleF.fastq.gz = 13gb maleR.fastq.gz = 13gb
after merging:
male.merged.fastq = 34 gb unmergedmale.fastq = 2gb
I think that pandaseq worked properly since most of the reads were in the merged file. Do you ever use paired-end reads merged with zcat ( F and R ) for alignments? I was just worried about the size of the input file if i use zcat (will be more that 60gb).