Hello,
I want to know if paired end sequencing we have two files and R2.fq and R1.fq.
do R1.fq is the first strand of DNA and the other R2.fq is the second strand of DNA?
and how the overlap between the reads? do we must merge the two reads of R1.fq and R2.fq in one reads?
ADD COMMENT
• link
updated 2.4 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
0
Entering edit mode
Hello,
I have some question please. How can we know the "known distance" between paired-end reads? And if we have two files Forward and Reverse, do reverse is always the reverse complement strand? We cannot find reverse complement reads in the Forward file?
Thanks
ADD REPLY
• link
updated 2.4 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
Known distance is usually a best guess based on how long the target sequences are supposed to be: library insert size or length from timed PCR.
Yes, reverse is from the reverse complement. Here's a video that describes Illumina paired end sequencing:
ADD REPLY
• link
updated 2.4 years ago by
Ram
44k
•
written 8.8 years ago by
anp375
▴
190
0
Entering edit mode
how to know the distance? I have just paired end file??
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
0
Entering edit mode
If you look here, inside this forum, you can get the answer
To know the distance, you need to map these reads to a reference and get the SAM/BAM mapping file
There are some tools that allow you to discover the distance between reads mates
You should specify a little bit what king of paired reads you have (SOLID, Illumina, ...).
about merging : it depends on you what you're going to do with the reads (what kind of analysis)? In my point of view read mapping and assembly are more accurate with paired end reads instead of merged reads, but again it depends on what you want to do
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
0
Entering edit mode
All your rx.1 reads are your forward reads stored in one file and rX.2 are the reverse ones. You can use an assembler (SPAdes, Ray,...) and specify on command line which file contains the reverse and the forward reads. But choosing the assembler is related to the type organism you are studyig, some assembly tools performs better on viruses others on bacterial organism. Before going into any assembly step, did you pre process your reads? any filtering? any trimming?
But I want to understand the assembly when it was paired end reads (two files).
for example, if I want to overlapping reads do I take the foward and the reverse reads and I'm looking overlap?
or do I make the reverse-complement of R2.fq?
thanks
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
0
Entering edit mode
Just give your reads to an assembler. Don't apply your preprocessing unless the assembler is requiring so. If you want to understand how paired-end assembly works, read the paper of the assembler in use. Different assemblers work in different ways. Choose your assembler first, and then ask questions.
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
Kamil
★
2.3k
0
Entering edit mode
Thank you for your help.
I'm not talking RNA. I have DNA data and i want to do an assembly but before I need to understand the R1.fq R2.fq files and to know how to make the overlap.
With paired end, you sequence both ends of a same shotgun fragment. Is just as you have drawn. The middle part, represented in your schemes as dots, remains unknown
Assembling principles does not differ too much on the assembling of single end sequencing (where only a single end of the shotgun fragment is sequenced), but take into account the information of both ends at the same time
This is highly advantageous. With the double of sequenced ends by each of the shotgun fragments, you simply have doubled the amount of information for the assembling. The fact that both paired sequences from the same fragment is sequenced, and is separated by a known distance, include restrictions to the assemblers and allow a better assignment of the sequences in the final dratf of the genome
So I have both R1.fq and R2.fq files, I just find the overlap between reads in the two files without make the reverse-complement of R2.fq? I use both the file as they are?
thankyou
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
1
Entering edit mode
Exactly. The assemblers will take care of it. But you need to inform the program that this is paired data. Most of times is by simply writing in the command lane one file followed by the next
I know that I can inform the assembler type my files.
But me, for example, if I want to create an assembler.
I wanted to know the paired end reads. I want to understand the paired reads fonctionnality.
tell me if I'm wrong, so for example if I will assemble paired reads I must take both R1.fq and R2.fq file and I find the overlap between the reads of the two files ??
I'm no expert but I want to understand the basics. I have read many articles but there is no article that addresses such questions.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
2
Entering edit mode
When you sequence with the new NGS (Next Generation Sequencing) systems, you break down the DNA or the cDNA into pieces (the shotgun pieces) and run an step in which you select for a determined size (let's say, 600 bases plus a standard variation, that is plus/minus 50 bases)
Then you know by sequencing 100 bases of both ends, that there is a central part of the sequence that is approximately 400+/-50 bases long whose sequence remains unknown. And this is not a big deal, because that region will be covered by other shotgun sequences as long as the break down of many other DNA sequences have been done at random. Simply by chance, other sequenced ends will be covering that space
You only have to change your mind a little bit. The assembler will try to fit everything looking for overlapping taking into account these pieces. If only a end is sequenced, I know that this is easy for you to understand.
Now you need to consider that you have actually a block of paired sequences of the same DNA fragment separated by a certain distance that needs to play the same game
Every end, can be saved into a different file. But the internal name of each read will allow the assembler to recognize which is its corresponding mate in the other file, if present. That is why you can have two separate files
here was an overlap between r1.1 and r3.1 therefore can assemble them.
and we also overlap between R1.2 and r2.1.
In this case, it was an overlap between the reads from F1.fq and reads from F2.fq. Do we assemble in the same way as for r1.1 and r3.1 ?? (without create the reverse complement of R1.2? we take the sequence as in the file?)
Thank you
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
midox
▴
290
1
Entering edit mode
I don't see in your example that r1.2 overlaps with r2.1
I see that r2.1 overlaps with the reverse of r3.2, and this latter with r1.2. So, a contig can be formed with r2.1, and the reverse of r3.2 and the reverse of r1,2 in this order
So to make an assembly, if we has paired end we must make the reverse of rX.2 reads, so the second file to find the overlap between all reads (two files)?
it is necessary to have the reads in the same direction
One more thing. Some assemblers require the you hold your paired reads in a single fastq file. Most often, they are separated files
If you look into this Wiki on Fastq you will note that in every read, it is contained the spatial assignation (=read location) of the sequence in the sequencer that allow to recognize which ends are paired, the presence or not of barcodes, and whether the sequence is the first read or its mate with an /1 and /2 respectively
As for your original questions, from my own experience: 1) r1 holds reads from both strands and so does r2. The only thing you know is that line x in r1 corresponds with line x in r2, each on a different strand. 2)You can align each read against a genome reference. In that case, you need to c-reverse half the data. You do not have apriori information which read to reverse. You need to align both ways. 3) You can merge first than align. You would still need to check both alignment directions.
in this case, how can we know the relations between the paired-end reads to find the relation between the contigs after assembly to create scaffolds??
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
Assemblers use information about paired end sequences to form contigs
Gaps and repeated sequences are the responsible for the appearance of many contigs
But assemblers cannot join or organize contigs into scaffolds once the assembly is done. They simply cannot do it, and this is why you end with many different contigs
To organize contigs into scaffolds you need a different strategy like the using of mate-paired reads, long sequencing reads (PacBio, Long Illumina, Nanopore) or a comparison with a trusted genome
we can not make the scaffolding with only the paired end reads?
THanks
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
Look information in this forum and into Illumina web pages about mate paired which is a different kind of paired sequences in which both reads from a same fragment retain long distance genomic information (several kb usually)
Assemblers cannot go beyond forming contigs because the limitations of short shotgun sequences and the presence of repeated sequences (hard to manage) and gaps. If an assembler using paired-end sequencing is giving you different contigs is because it cannot go beyond that. Assembling with second generation sequencers that are using short shotgun technology are far from perfect and it is very limited. An assembler trying to assemble a simple 4,5Mb E.coli genome with a 100X coverage of Illumina reads can provide you between 150 to 250 different contigs
This is why you need to overcome this limitations with mate paired, long sequencing or, if possible, the comparison and ordering of contigs using a trusted reference geneme if it is available
So with paired-end reads I have to download the pair mate??
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
2
Entering edit mode
If you download paired-end reads you are downloading both mates from a single fragment. If you download a paired-end sequence, you must be downloading the sequences from both extremes of a same fragment. You can also be provided by a single end sequence, in which only of the two ends is being sequenced
But don't get confuse. Every read from a paired end are named mates because they are related one to the other by the fact that are both extremes from a same fragment separated for less than 300 a 500bp
A mate-paired is however different. This fragment involves a completely different protocol to obtain it and correspond to sequences from a unique fragment which is several kilobases long. So the mates into a mate-paired fragment is separated by several kb
Check into the Illumina page for the protocols to obtain paired-end and mater-paired sequences
If you see useful these sequences, acknowledge them by voting. If you see close the subject, do the same, so people can be alerted that this contain useful information. No votes, no interest in reading this..
I downloaded the paired end reads with length 300bp but I have no mate pairs.
How I do in this case??
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
You have not many choices. You can only assemble with these paired end data
Use different assemblers such as Mira (OLP method) and Velvet, Spades (Der Bruijn) and try to compare, measure several assembly indexes such as N50, etc
how to do in this case, do I have to download the mate-pair file?
do I get a link to download the mate-pair for E.coli?
thanks
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end
If you want to improve the assembly, you have a hard task. The only think I believe you can do is use different assemblers with the hope that one is better than the other. Usa Mira, and velvet, spades, etc
No, I want to do a scaffolding. I built contigs from an assembler but it is necessary to make the scaffolding and I do not want use a scaffolding tool, I want to do it alone but I do not know how!
Based on your advice you told me to use the mate-pairs but I do not know how to have the mate-pair?! This is my problem.
Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end
I have not prepared the paired-end reads but I downloaded the files. How to do for mate pairs?
Thanks
ADD REPLY
• link
updated 2.4 years ago by
Ram
44k
•
written 8.8 years ago by
midox
▴
290
1
Entering edit mode
You need to prepare your mate-paired at the time you are preparing your paired-end fragments.. If you check the Illumina information, you can notice that both follow a different protocol
If you don't have mate-paired sequences, you can't do nothing. You need to prepare mate-paired at the same time with the same genome, and sequence everything at the same time
Yo don't mention what is your genome. Maybe you can have a trusted reference genome to compare with programs like Mauve that allow you to organize your contigs into scaffolds
These two organisms have nice and trusted genomes you can use as reference.
Download and install Mauve, read its instructions, and use the tool of organize the contigs resulting from the assembly using a comparison with these reference genomes
Hello,
I have some question please. How can we know the "known distance" between paired-end reads? And if we have two files Forward and Reverse, do reverse is always the reverse complement strand? We cannot find reverse complement reads in the Forward file?
Thanks
Known distance is usually a best guess based on how long the target sequences are supposed to be: library insert size or length from timed PCR.
Yes, reverse is from the reverse complement. Here's a video that describes Illumina paired end sequencing:
how to know the distance? I have just paired end file??
If you look here, inside this forum, you can get the answer
To know the distance, you need to map these reads to a reference and get the SAM/BAM mapping file
There are some tools that allow you to discover the distance between reads mates
have you any references please?
http://lmgtfy.com/?q=figure+out+internal+distance+in+paired+end+reads