Question

Using SRA Toolkit and Samtools to look at reads mapping to one region in public database

0

Entering edit mode

7.1 years ago

garbuzov ▴ 70

Hi everyone, I'm new to querying public data, so bear with me. I am interested in chip-seq reads mapping to a specific region of the genome and I want to quickly query lots of different data sets for this region.

From googling around this is the method I came up with. I tried it first on a positive control region which should have lots of mapped reads according to the paper. I get zero reads mapped.

Here are my commands:

/#for prdm1 on H3K27ac in day 6 pgc:

sam-dump --aligned-region 10:44170511-44216948 --output-file Prdm1_SRR1539456.sam SRR1539456

/#convert sam to bam:

samtools view -S -b Prdm1_SRR1539456.sam > Prdm1_SRR1539456.bam

/#count the number of mapped reads:

samtools flagstat Prdm1_SRR1539456.bam

When I call flagstat this is what I get: 115516081 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

0 + 0 mapped (0.00% : N/A)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (N/A : N/A)

0 + 0 with itself and mate mapped

0 + 0 singletons (N/A : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

The file Prdm1_SRR1539456.bam is 4.2G so it is not empty... what am I doing wrong? I'm really stumped.

Thanks in advance!

ChIP-Seq sra alignment • 2.4k views

ADD COMMENT • link 7.1 years ago by garbuzov ▴ 70

score 0 · Answer 1 · 2017-10-19

0

Entering edit mode

7.1 years ago

ATpoint 85k

The data you download are simply not aligned. You will have to download them as SRA, e.g. using the prefetch application, followed by fastq-dump or directly as fastq via the European Nucleotide Archive), which mirrors most NCBI data. The ENA offers download via Aspera (check the manual on the ENA pages), which allow downloads with up to 100Mb/s.

ADD COMMENT • link 7.1 years ago by ATpoint 85k

0

Entering edit mode

So, to clarify, you are saying the SRA file I am working with is not aligned so I have to do the alignment myself. I have to download the entire SRA file (That's what I'm trying to avoid.) From looking at the GEO how can I tell if the SRA is aligned?

ADD REPLY • link 7.1 years ago by garbuzov ▴ 70

0

Entering edit mode

Yes, the SRA archive typically contains the raw data from an experiment. To be honest, I've never seen that SRA contains aligned BAM or SAM files. This apparently exists sometimes (according to some brief google), but I have no recommendation on how to check that other than looking at the summary page for your accession number. I personally would also not trust any alignment I did not run myself, because you never know what exactly they aligned it to, so if you want to build a project on your data, I would recommend the following:

Sort out the accession codes you want and check the download paths on the ENA, e.g. for your file, the path would be: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz

Download the Aspera Connect client and follow the install manual. For download with aspera, you have to modify the path from ENA:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz becomes

era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz

Download with the following command:

 ASPERA_ENA="ascp -QT -l 300m -P33001 -i /full/path/to/your/asperaweb_id_dsa.openssh"
$ASPERA_ENA era-fasp@fasp.sra.ebi.ac.uk:**vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz /target/dir

Check the manual on ENA for more details. From there, you will have to align the stuff and manually filter for the regions you are interested in.

ADD REPLY • link 7.1 years ago by ATpoint 85k