Entering edit mode
10.2 years ago
Chirag Nepal
★
2.4k
Hi there,
I downloaded publicly available datasets. which is 100bp long pair-end reads (with 200 nt of insert size). I downloaded fastq to map to the genome, but it seems authors have merged the reads. I blat few examples and only 100 reads map in stretch.
Fastq example
@SRR893106.1 1 length=202
CATAGGGTGCTCCGGCTCCAGCGTCTCGCAATGCTATCGCGTGCACACCCCCCAGACGAAAATACCAAATGCATGGAGAGCTCCCGTGAGTGGTTAATAGGGGGAGCCTATCATATATCTCCCTACCAACAAACCTACCCACCCTTAACAGCACATAGTACATAAAGCCATTTACCGTACATAGCACATTACAGTCAAATCC
+SRR893106.1 1 length=202
@@@FF?B:CFHHHIJGIIIIIEIHGGIIIJGGGGGGDABBB8;;CDAEHHFFD:?9?BBBDB>ACA:CD:>CCDDBC<(8?>:@?B8>@?:A@ABC3>3@>?<**ACACCAC;::>:;>5
@SRR893106.2 2 length=202
AGACAGATACTGCGACATAGGGTGCTCCGGCTCCAGCGTCTCGCAATGCTATCGCGTGCACACCCCCCAGACGAAAATACCAAATGCATGGAGAGCTCCCGGGGGTAGCTAAAGTGAACTGTATCCGACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGA
+SRR893106.2 2 length=202
CCCFFFFFGHHHHJJJJIJJJIAHHEIIIJJIJJJJGIDDEGAA@GGGIIHHHEFF**BBDCCCDDCDCDCDEDDCBCDACDDDD@@CFBDDFHHGHHCGHHIJJJJJJJJIJIJJJJJJIIIIJJJJJJJJJIJJIJJJIJJIJJJJJJJJIHHHHF?@ECECEDCDDEDDDDDCCCDDDBD?@?
I checked sequence using FASTQC which suggest authors have merged reads.
https://www.dropbox.com/s/clpmi7cktr2w7l1/Screen%20Shot%202014-09-18%20at%2017.35.31.png?dl=0
Is there any existing tools or suggest how to separate it.
Thanks in advance !
Cheers
Did you download the sra file and then forget to use the
--split-3
option?Looks like. Fastq files are available for each end separately at the ENA: http://www.ebi.ac.uk/ena/data/view/SRR893106
Thanks matted !
ENA has the correct file format, which can be found here.
This is what i used to download SRA:
And default parameter of fastq-dump
--split-3
is option on which tool? fastq-dump?If you don't do that you'll get merged reads like this. Anyway, as matted said, it's usually easier to see if ENA has the fastq files first.