command to extract SRA fastq data summary
2
0
Entering edit mode
16 months ago
Nelo ▴ 20

Hi,

I was trying to calculate the total read length of all the sample present in bioproject in command utility as:

esearch -db bioproject -query "PRJNA438426" | efetch -format docsum | xtract -pattern DocumentSummary -element RunTotalBases

which is giving me no output.

In another way

grep -c "^@" SRR6943205_1.fastq SRR6943205_2.fastq 
SRR6943205_1.fastq:18361521
SRR6943205_2.fastq:18361521

This is giving me the desired output but only when they are downloaded in fastq format in my system.

I have downloaded the SRA data in fastq format long ago, but they are already compressed (.fastq.gz). So in order to calculate the read length using second code snippet, I have to gunzip them all which is time as well as resource consuming for me now.

third option is to check every single fastqc html output of each sample individually which is also time taking.

I would greatly appreciate if someone could help me get me design a linux command to get my desired output.

Many thanks!

edirect NCBI SRA • 938 views
ADD COMMENT
2
Entering edit mode
16 months ago
bk11 ★ 3.0k

I have downloaded the SRA data in fastq format long ago, but they are already compressed (.fastq.gz). So in order to calculate the read length using second code snippet, I have to gunzip them all which is time as well as resource consuming for me now.

You can use your gz file and find number of reads and read length in your *fastq.gz file as follows-

zcat Your_SRR_R1.fastq.gz |awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > R1_length.txt
cat R1_length.txt #(Will show you number of reads in *R1.fastq.gz in first column and read length in the second column)
20299247 28

zcat Your_SRR_R1.fastq.gz |awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > R2_length.txt
cat R2_length.txt #(Will show you number of reads in *R2.fastq.gz in first column and read length in the second column)
20299247 91
ADD COMMENT
0
Entering edit mode

Thank you for providing this code. Its really help me. Is there any way to count GC% of the sample using its accession?

ADD REPLY
0
Entering edit mode

To calculate GC%, you can do-

zcat Your_SRR_R1.fastq.gz |awk '(NR%4==2) {A1+=length($0);gsub(/[AT]/,"");A2+=length($0);}END{print A2/A1;}' >r1_gc-pct.txt
cat r1_gc-pct.txt
0.482669 #(ie 48.26%)
ADD REPLY
0
Entering edit mode
16 months ago
GenoMax 148k

You need to do the following:

$ esearch -db bioproject -query "PRJNA438426" | elink -target sra | efetch -format runinfo | cut -d "," -f1,5
Run,bases
SRR6943215,6590673000
SRR6943216,6714951000
SRR6943213,7042552800
SRR6943214,6215158500
SRR6943219,7075158000
SRR6943220,6898378200
ADD COMMENT

Login before adding your answer.

Traffic: 1998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6