Question

How Can I Use Mate-Pair Sequences For Soapdenovo?

1

Entering edit mode

13.1 years ago

toshnam ▴ 650

Hi all,

I want to assemble paired-end sequences and mate-pair sequences (HiSeq2000) together using SOAPdenovo.

On SOAPdenovo home page, mate-pair usage is written as follows: "Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair." (http://soap.genomics.org.cn/)

How can I convert raw mate-pair FASTQ file into proper format for SOAPdenovo assembly? Is there any converting script?

Thanks.

• 9.6k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 13.1 years ago by toshnam ▴ 650

score 2 · Answer 1 · 2011-10-17

2

Entering edit mode

13.1 years ago

Fabian Bull ★ 1.3k

The important thing is to set reverse_seq to 1.

Example config:

max_rd_len=100 [LIB] avg_ins=2000 reverse_seq=1 asm_flags=3 rank=1 q1=/path/to/fastq_read_1.fq q2=/path/to/fastq_read_2.fq

This sets the maximal read length to 100 and the average insert size to 2000. The asm_flags is used to declare that your reads are used for assembly and scaffolding. The rank parameter can be set if you have multiple libraries. fastq_read_1.fq and fastq_read_2.fq are FastQ-files having the same reads in the same order. If you have single reads to add (maybe some ends were thrown out in quality filtering) use q. If you have FastA-files instead of FastQ-files use likewise f1,f2 and f

ADD COMMENT • link 13.1 years ago by Fabian Bull ★ 1.3k

0

Entering edit mode

Thank you for your help. "reverse_seq=1" is a solution for using mate pair sequences! Below is SOAPdenovo home page's comments. "There are two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 800 bp; b) reverse-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. User should set parameter for tag "reverse_seq" to indicate this: 0, forward-reverse; 1, reverse-forward."

ADD REPLY • link 13.1 years ago by toshnam ▴ 650

score 1 · Answer 2 · 2011-10-17

AFAIK you can use the mate-paired-end illumina dataset as you are using the regular PE datasets using the /1 and /2 delimiters. However you need to make sure the reads are pointing the right direction! mp libs are generating read pairs like A<---->B and you need to madify them into A---><---B (revcomplement otherwise you get negative mapping distances). I think that is all.

Please beware to filter your mp dataset beforehand since it is known to contain easily many adapter artefacts.

my 2ct

Ram · Answer 3 · 2011-10-17

You create a config file like this (following the manual at http://soap.genomics.org.cn/soapdenovo.html#comm2):

max_rd_len=125
[LIB]
avg_ins=200
asm_flags=3
reverse_seq=0
rank=1
q1=/home/jvh/data/SequenceAssembly/nobackup/SRA/SRP000220/SRX000429/SRR001665_1.fastq
q2=/home/jvh/data/SequenceAssembly/nobackup/SRA/SRP000220/SRX000429/SRR001665_2.fastq

In the config file, you define libraries with [LIB] , and each library can contain different readsets. Above I have defined a readset consisting of two files, containing paired reads in the inward configuration, or as the manual says forward-reverse ( --> <-- ). How you reads are oriented depends on how the reads were produced. The reads must be in the same order in the files, and no read should be missing, otherwise SOAPdenovo will not work right.