How to determine if SRA file is single or paired end?

18

Entering edit mode

10.0 years ago

davedeto ▴ 260

I have a situation where I want to run batch script to align reads from a bunch of different samples in a GEO accession. Some are single-ended and some are paired, but the meta-data in the series matrix file does not indicate which is which. Now, I can manually convert to fastq and inspect the files to determine it, but I'd like to find an automated way to do this. I know that the SRA file must have meta-data stored in it to explain where the split should occur, but I can't figure out how to get at it. The only thing that looks like it might be what I want is the sra-stat program in the sra toolkit, however I can't find any documentation on its output, and the default text output is just a cryptic series of numbers divided up by colons/pipes.

I could always run sra-stat with the -s option, output as XML, and find the answer there, but this requires the routine to go through the entire file, which takes a while. I could also just run fastq-dump with the --split-files option and look to see if I get one or two files as a result, but this also seems like a bit of a hack. Is there a better way?

It feels like there should be some header information in the file that I could quickly access.

sequencing • 20k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by davedeto ▴ 260

10

Entering edit mode

10.0 years ago

Kamil ★ 2.3k

You might be interested to try my script:

	#!/usr/bin/env bash
	# sra-paired.sh
	# Kamil Slowikowski
	# April 23, 2014
	#
	# Check if an SRA file contains paired-end sequencing data.
	#
	# See documentation for the SRA Toolkit:
	# http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

	sra_paired() {
	local SRA="$1"
	local x=$(
	fastq-dump -I -X 1 -Z --split-spot "$SRA" 2>/dev/null \
	\| awk '{if(NR % 2 == 1) print substr($1,length($1),1)}' \
	\| uniq \
	\| wc -l
	)
	[[ $x == 2 ]]
	}

	if [[ "$1" == "" ]]; then
	echo "usage: sra-paired.sh file.sra"
	exit 1
	fi

	if sra_paired "$1"; then
	echo "$1 contains paired-end sequencing data"
	else
	echo "$1 does not contain paired-end sequencing data"
	fi

view raw sra-paired.sh hosted with ❤ by GitHub

ADD COMMENT • link 10.0 years ago by Kamil ★ 2.3k

4

Entering edit mode

This is a great idea to use the --split-spot option of the fastq-dump. Although your way above is definitely good, I think that davedeto has a slightly simpler solution which I incorporate here:

srr="SRR3184279"
numLines=$(fastq-dump -X 1 -Z --split-spot $srr | wc -l)
if [ $numLines -eq 4 ]
then
  echo "$srr is single-end"
else
  echo "$srr is paired-end"
fi

ADD REPLY • link 8.0 years ago by jabelsky ▴ 40

0

Entering edit mode

cool! very simplified, thanks ! :)

ADD REPLY • link 6.3 years ago by Geparada ★ 1.5k

6

Entering edit mode

10.0 years ago

davedeto ▴ 260

Kamil's suggestion to just use -X 1 and look at the first read was great! Thanks

I made this into a python function and thought I'd share in case anyone else wants to use it.

	def isPairedSRA(filename):
	filename = os.path.abspath(filename);
	try:
	contents = sp.check_output(["fastq-dump","-X","1","-Z","--split-spot", filename]);
	except sp.CalledProcessError, e:
	raise Exception("Error running fastq-dump on",filename);

	if(contents.count("\n") == 4):
	return False;
	elif(contents.count("\n") == 8):
	return True:
	else:
	raise Exception("Unexpected output from fast-dump on ", filename);

view raw isPairedSRA.py hosted with ❤ by GitHub

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by davedeto ▴ 260

0

Entering edit mode

Hi @davedeto,

I liked your python script to find the single-end or paired-end. I am very new to pytthon and in my case I have fastq files generated from illumina sequencing. In my case, the paired-end reads name are samplename_1_sequence.txt.gz and single-end reads name samplename_sequence.txt.gz. If I want to use the above script to my filenames, how would I chane it?

Kindly guide me

ADD REPLY • link 5.9 years ago by EVR ▴ 610

3

Entering edit mode

8.7 years ago

zpliu ▴ 60

Another simple way is to check the SRR ID of your sample in SRA Run Browser: http://trace.ncbi.nlm.nih.gov/Traces/sra/

"Browse" -> "Run Browser" -> then input your ID

The LAYOUT result will tell you. Also, the 'Reads' label shows 1 read for single end, and 2 reads for paired end.

ADD COMMENT • link 8.7 years ago by zpliu ▴ 60

Login before adding your answer.