Hi to everyone,
beforehand: I'm quite new to Linux and also the whole Assembly.
I received my Next-Gens Seq files from Eurofins. Method was Illumina Paired-End 2*150. The two files are each almost 10 million sequences long in fastq.
Somehow I managed to run SPAdes with my two files.
Dataset parameters:
Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
Reads:
Library number: 1, library type: paired-end
orientation: fr
left reads: ['xx1.fastq']
right reads: ['xx2.fastq']
interlaced reads: not specified
single reads: not specified
merged reads: not specified
Read error correction parameters:
Iterations: 1
PHRED offset will be auto-detected
Corrected reads will be compressed
Assembly parameters:
k: automatic selection based on read length
Repeat resolution is enabled
Mismatch careful mode is turned ON
MismatchCorrector will be used
Coverage cutoff is turned OFF
Other parameters:
Dir for temp files: xx/tmp
Threads: 16
Memory limit (in Gb): 15
(1) Are those parameters right? I don't get the difference between the paired-end mode, mate-paired and interlaced. We ordered Paired-End seq and I received 2 files called xx_1.fastaq.gz and xx_2.fastaq.gz Since I got two files I think thy are not interlaced, am I right. What's with the other modes and another point. are my fiels fr, rf or ff? I don't even know where to get this information from.
(2) If my parameters are right and SPAdes run through my files I want to map them with Bowtie2. i indexed my reference genome. But what files from SPAdes should I take for that? . I assume the contig.fasta, but what exactly is the scaffold.fasta and all the other files? So I ran bowtie2 with the following command and got this as output:
$ bowtie2 -x yy_REFERENCE -f -U xx/results/contigs.fasta -S yy/SAM/alignment_contigs.sam
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134
What does that mean? What is error value 134? I'm running Ubuntu 18.04 with 16 GB of RAM. I'm afraid that it means that I have to less RAM, is it possible? I also have access to several clusters with more CPU and RAM, but I have absolutely no clue how to run anything on them.
I had no problems running bowtie2 with the Lambda phage example files. It is also hard for me to find some information about all this. So I would really appreciate it, if someone of you have good books, papers or tutorials for that.
I know these are a lot of question, but I hope you can help me with that.
I'm looking forward to your answers. Kiluah
(1) Mate-pair is how you prepare your librairies, mate reads could be like 2k bases away from each other. Paired-end sequencing is a sequencing technique producing reads far from hundred of bases maximum. In these 2 techniques you will end up with 2 files (one for forward strand, another one for reverse strand). Interlaced is where you have these 2 files in a single one.
So I guess here you have paired-end reads not interlaced, as you discribe it
As far as I remember classic Illumina Paired-end sequencing is : first read of the fragment is sequenced as sense (forward) and the second is on the antisense strand (reverse)
So here you have FR, and you get this information checking illumina library preparation kit
2) Scaffolds are an association of contigs joined by N bases, so for a mapping I would go for the contigs files as Bowtie2 will look at an end-to-end comparision
And for your Bowtie2 issue, what was the command line to generate the index, what is the size of your contigs.fasta file ? It is like a classic memory issue but with 16Gb of RAM it should be enought with genome like mouse, human...
What is the exact definition for scaffold?
Is it better to annotate contigs or scaffolds
Trinity strand specific: RF or FR
https://galaxyproject.org/tutorials/ngs/
Hello Bastien,
the command for the indexation was the same as in the Lambda example:
yy/bowtie2-build xx/reference/organism.fa organism_REFERENCE
After that I receved the same six files as in the example. My contigs.fasta has 6,78 MB and around 3000 nodes. Somewhere I read, that bowtie2 expects a fasta(q) with a single row for each entry, but the SPAdes output is 60 characters/row, might this be the problem? I will try to change it and run bowtie2 again.
My NGS data is from bacteria ~ 5,5 Mb. So according to you it shouldn'n be a RAM problem, right? But what is the problem then?
Please, use the
add reply
grey button to add a reply to a comment. This keep the thread readable and well organized. As you can see I moved it but it's not perfect.Try to run bowtie2 without the
-U
optionStill the same issue. Same error code:
Generally that error signifies problems related to memory. Have you tested
bowtie2
program with a small dataset. Take a couple of contigs (fromcontigs.fasta
) and try running the program to see if program works. If it does not then you will need to find alternate hardware.The contigs.fasta file is 6.78MB, that should be OK for Bowtie2
You are right. It runs if I cut all contics with >20 kb. So it seems the contigs themself are to long. My longest contig is almost 500 kb.
If I remove some of the longest contigs I get this:
So I tried to remove even more and after I am <20 kb I receive that:
But the alignment rate is pretty low, isn't it? I only removed the first 32 contigs. Is there a way to ask bowtie2 to show me just the aligned sequenzes?
As swbarnes2 said it in an answer, why do you want to align your fasta rather than your fastq files ?
Give a try to BWA-mem or minimap2 if you still want to align your contigs
http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa