Hi everyone, I'm new to querying public data, so bear with me. I am interested in chip-seq reads mapping to a specific region of the genome and I want to quickly query lots of different data sets for this region.
From googling around this is the method I came up with. I tried it first on a positive control region which should have lots of mapped reads according to the paper. I get zero reads mapped.
Here are my commands:
/#for prdm1 on H3K27ac in day 6 pgc:
sam-dump --aligned-region 10:44170511-44216948 --output-file Prdm1_SRR1539456.sam SRR1539456
/#convert sam to bam:
samtools view -S -b Prdm1_SRR1539456.sam > Prdm1_SRR1539456.bam
/#count the number of mapped reads:
samtools flagstat Prdm1_SRR1539456.bam
When I call flagstat this is what I get: 115516081 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
The file Prdm1_SRR1539456.bam is 4.2G so it is not empty... what am I doing wrong? I'm really stumped.
Thanks in advance!
So, to clarify, you are saying the SRA file I am working with is not aligned so I have to do the alignment myself. I have to download the entire SRA file (That's what I'm trying to avoid.) From looking at the GEO how can I tell if the SRA is aligned?
Yes, the SRA archive typically contains the raw data from an experiment. To be honest, I've never seen that SRA contains aligned BAM or SAM files. This apparently exists sometimes (according to some brief google), but I have no recommendation on how to check that other than looking at the summary page for your accession number. I personally would also not trust any alignment I did not run myself, because you never know what exactly they aligned it to, so if you want to build a project on your data, I would recommend the following:
Sort out the accession codes you want and check the download paths on the ENA, e.g. for your file, the path would be: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz
Download the Aspera Connect client and follow the install manual. For download with aspera, you have to modify the path from ENA:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz becomes
era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz
Download with the following command:
Check the manual on ENA for more details. From there, you will have to align the stuff and manually filter for the regions you are interested in.