Question

How To Generate A Paired End File Suitable For Bwa

0

Entering edit mode

12.1 years ago

GPR ▴ 390

A question about BWA and paired-end reads. In TopHat, one inputs the paired-end fastq file twice and the tool understands you have paired-end reads. I see that in the case of BWA one must align each pair separately. Can somebody give me a hint on how to generate these two matching files from my paired-end one? Thanks. G.

bwa • 6.0k views

ADD COMMENT • link updated 12.1 years ago by Ashutosh Pandey 12k • written 12.1 years ago by GPR ▴ 390

0

Entering edit mode

I don't understand your question, do you have one fastq-file with mixed pairs and want to split for BWA? or you don't know how to run BWA with 2 fastq PE files.

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

The first: have 1 paired-end fastq and want to split the pairs.

ADD REPLY • link 12.1 years ago by GPR ▴ 390

0

Entering edit mode

a simple script can help you, can you show us the input? (just to be sure of the format)

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

I have *.dat files.

ADD REPLY • link 12.1 years ago by GPR ▴ 390

1

Entering edit mode

*.dat? do you have fastq/fq or sam/bam? In fastq file mixed pairs can be in two (as far I know) formats:

1) each read is reported independely:

  @read_1
  ACATTCATTCATCTAT
  +
  BBBBBBBBBBBBBBBB
  @read_2
  TGCATGCAGCATGGCC
  +
  BBBBBBBBBBBBBBBB

2) both pairs are fusioned:

@read_12
ACATTCATTCATCTATTGCATGCAGCATGGCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

but 2) can be tricky depending if pairs are f-f or f-r

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

I am not aware of this format. If are using unix can you do sth like "head -10 yourdatfile" and copy and paste the output here.

ADD REPLY • link 12.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

OK sorry, It's FASTQ right out of the Illumina instrument. I said *.dat, because that's how the end. They are as I say FASTQ.

ADD REPLY • link 12.1 years ago by GPR ▴ 390

1

Entering edit mode

As JC mentioned above if the files are sth like this:

@read_1
ACATTCATTCATCTAT
+
BBBBBBBBBBBBBBBB

@read_2
TGCATGCAGCATGGCC
+
BBBBBBBBBBBBBBBB

Then you can use grep function in unix. For example,

grep -A3 "* _1" Filewithmixedreads.txt  >  Filewithreadsfrom_1end.fastq should work. 

Similarly you should be able to get reads ending with _2 in another file. 
Hope this helps otherwise please print some lines from the file for us to see. 

Thanks.

ADD REPLY • link 12.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

My reads actually look like this: + CCCFFFFFHHHHHIHGGHFIJJIIJJIJIEHHJJJFGJIJIIIIJJJIGCGHIIJHEHIJJB=?DFBCCBEEEEDA @HWI-ST974:67:C0545ACXX:2:1101:1628:32111 1:N:0: GTTGGAGCAGGCCCGCAAGGCCGAAGAGGTGCAGGCCTGGGCGCAGCGCAAGGAGCGGGAAGTGCTGCAGCTGCAG + @BCFFFFFHHHGHJIJJJJJIJJJJJGJJFHIGJJIJJJJIGGHFFEDDDDDDDDDDDDDBBCDDEDDDDDDDDD> @HWI-ST974:67:C0545ACXX:2:1101:1521:32121 1:N:0: GTGACTGTCGTGTCCTCGTCGACCTCCTTCTCCTGTCGCTCCAGATCCGCCTCAATCTCCTTGAGCTCTTCCAGCT Thanks!

ADD REPLY • link 12.1 years ago by GPR ▴ 390

0

Entering edit mode

that looks like a single-end reads or the first pair in a pair-end (1:N:0), are you sure that do you have mixed paired-ends? Actually, from your example, your first sequence map to MPRS26 (chr20:3027308-3027383), and the second to SPC24 (chr19:11258704-11258779) in hg19 without spanning.

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

I am pretty sure these are paired-end reads. Have actually analysed this data with TopHat-Cufflinks. So I guess in my data each read was reported independently and using grep is a good option. Thanks so much!

ADD REPLY • link 12.1 years ago by GPR ▴ 390

score 0 · Answer 1 · 2012-11-05

For BWA, you will have to provide all the reads (1 file containing all the forward or _1 reads from the paired ends) belonging to one end of the paired ends in the first step. You will have to redo the same step but this time you need to provide reads from the other end (1 file containing all the reverse or _2 reads from the paired ends). Both of these steps will produce .sai files (1 for each step OR two in total). These two files will be used by BWA sampe and 1 sam file will be produced.

1) bwa aln 1 file containing reads from one end > 1.sai file 2) bwa aln same step for the other end > 2 .sai file 3) bwa sampe 1.sai 2.sai > bam file

Hope this helps you.