Question

How To Map The Fastq With Paired-End Seq Combined In One File

2

Entering edit mode

11.9 years ago

camelbbs ▴ 710

HI,

I found some rnaseq in encode were sequenced in paired-end but were combined in just one file like this:

@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
GTTAGGGAGGTTATGGAGGTTAGGGAGGTTATGGAGGTTATGGAGGTTAGCCTCGGTCTCCACCATAGCCTCCACCTCGGTCTCCTCCATAGCCTCCTCG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHHHHHDFBFFFEHHHHDGGG?HHDHAGFG7GC9C8:HHHHHHHHHHHHHHHIHHHHHHHHHHHHHGHHHHHHHHFHHHHHHHHHHD

They sequenced as 2X50bp but combine both end in one file. How to handle this type of data. Should I treat them as single end in mapping?

Thanks a lot !

rna-seq • 9.4k views

ADD COMMENT • link updated 11.9 years ago by Geparada ★ 1.5k • written 11.9 years ago by camelbbs ▴ 710

0

Entering edit mode

you are only showing a single record above

ADD REPLY • link 11.9 years ago by Istvan Albert 101k

0

Entering edit mode

This is the only record. The description is: 50bpPEmRNASeqFCAs51sequence, 2x50 paired end mRNA-seq READ1, ~ 0.5% phiX DNA spiked in, Performer: ILLUMINA-CA

ADD REPLY • link 11.9 years ago by camelbbs ▴ 710

0

Entering edit mode

I'd double-check your metadata. It would be very hard to create a single fastq file from two paired-end files. If these are publicly available data, can you share the links to the metadata and the files that you believe are incorrect?

ADD REPLY • link 11.9 years ago by Sean Davis 27k

0

Entering edit mode

It's GSM759515 in GEO. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759515

Thanks

ADD REPLY • link 11.9 years ago by camelbbs ▴ 710

0

Entering edit mode

One can obtain fastq files directly from EBI ENA. The accession number for the sequencing data is ERX011212. A search for that at the EBI ENA website will lead to http://www.ebi.ac.uk/ena/data/view/ERX011212 where you can download the fastq files directly. Note that SRA and ENA mirror one another, so fastq files for SRA can almost always be obtained through going to EBI.

ADD REPLY • link 11.9 years ago by Sean Davis 27k

0

Entering edit mode

thanks a lot...

ADD REPLY • link 11.9 years ago by camelbbs ▴ 710

score 5 · Answer 1 · 2013-01-11

5

Entering edit mode

11.9 years ago

lh3 33k

awk 'NR%2{print>"read1.fq";print>"read2.fq"}NR%2==0{print substr($0,1,50)>"read1.fq";print substr($0,51)>"read2.fq"}' in.fq

ADD COMMENT • link 11.9 years ago by lh3 33k

Istvan Albert · Answer 2 · 2013-01-11

This happens a lot in SRA, you should split the file into two fastq contaning 50bp each. Something like this:

reads1.fastq
@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=50
GTTAGGGAGGTTATGGAGGTTAGGGAGGTTATGGAGGTTATGGAGGTTAG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHHHHHDFBFFFEHHHHDGGG?HHDHAGFG7GC9C8:

reads2.fastq
@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=50
CCTCGGTCTCCACCATAGCCTCCACCTCGGTCTCCTCCATAGCCTCCTCG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHIHHHHHHHHHHHHHGHHHHHHHHFHHHHHHHHHHD

Reads this http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

And try to use one of this commands to extract 2 fastq files from the SRA file:

fastq-dump --split-files SRR443885.sra

or

fastq-dump --split-3 SRR306633.lite.sra

score 1 · Answer 3 · 2013-01-14

I'm currently working with paired-end reads that are merged in the same way ...but each merged read is 155 nt length, so I don't have to split the reads a the half.

@SRR594435.4067 HWI-ST333_0042:5:1:16857:2275 length=155
GAACGTTGTCTTGCATTCTGTGGCAACTGCTTTAGCAAGGAGGGTCTTTCCAGTGCCAGGTGGGCCAACCATCAGCACAAAAAAGTTGCTTCAGGAAGCAGTGGTGTTACCAATGTTGATGCCAGAATTCTTTAAGGGCATTAGGAGACCCTGGA
+SRR594435.4067 HWI-ST333_0042:5:1:16857:2275 length=155
]Y]a\___b_\cc\b```^L`ZXXXZQO_UUVVVIOUVITSRMU]\\WaUYXO`WWX^BBBBBBBBBBBBBBBBBBBBBBYT^YLbbb\`b^Yc\bZU]\VOQQNQ_\K_ZXZZS_JZ`XUIOVLZ`^ZVVUVY^BBBBBBBBBBBBBBBBBBBB

I figured out that the 80 first nt came from the forward and the other 75 from the reverse, so I couldn't use fastq-dump. In the morning I wrote a code to split the reads at any position... may be useful if anyone have the same problem.

import sys
from Bio import SeqIO
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqRecord import SeqRecord


def main (fastq):
    """ Splits the reads at N position. It generate two files rd1 and rd2 """

    N = 80    

    f = open(fastq)
    rd1 = open('rd1', 'w')
    rd2 = open('rd2', 'w')

    for read in SeqIO.parse(f, "fastq"):

        rd1_seq = read.seq[:N]
        rd2_seq = read.seq[N:]

        rd1_id = read.id + "_1"
        rd2_id = read.id + "_2"

        rd1_Q = read.letter_annotations["phred_quality"][:N]
        rd2_Q = read.letter_annotations["phred_quality"][N:]

        reads_rd1 = SeqRecord( rd1_seq, rd1_id, description = "" )
        reads_rd1.letter_annotations["phred_quality"] = rd1_Q    

        reads_rd2 = SeqRecord( rd2_seq, rd2_id, description = "" )
        reads_rd2.letter_annotations["phred_quality"] = rd2_Q    

        rd1.write(reads_rd1.format("fastq"))
        rd2.write(reads_rd2.format("fastq"))


if __name__ == '__main__':
    main(sys.argv[1])