Question

Measuring sequence length in "bp"

0

Entering edit mode

10.1 years ago

dpidad • 0

Given a sequence as below:

@SRR211279.25468524 HWUSI-EAS404_106009863:7:120:17892:21339 length=200
CCAACCTCTACCCATNACCCAGTTCCGAAGTTGCTTCCACATTTTCAGGTATCTTTATAGNNATGCTCCAGTCCTCATTTGCCATTTTTGGTAANANTTANCTNTGTANTCTCCGNNNTNNNCNCTNGCNATNTNANANNNTTCANTNNNNNNNNNNNNNNNANNNNNANTNANANATTTCNGAGCCCCCCCAGANGCAG
+SRR211279.25468524 HWUSI-EAS404_106009863:7:120:17892:21339 length=200
IIIGIIIIIIGGGGG%DEEEEDBDEIHIIIHIIIIIIIHIDHIHHIIIGGIGIIIGEEEE%%;==><;>>IIIIHIIIIIIIIIIIIIIIGDDD%8%;;8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

What is the length of the sequence? 200 bases? or 200base pair(bp) or 100bp? What tool is recommended for measuring the length in "bp" for FASTA/FASTQ File?

genome sequence • 5.0k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by dpidad • 0

Ram · Answer 1 · 2015-06-12

0

Entering edit mode

10.1 years ago

Devon Ryan 105k

Technically, base pairs refers to double-stranded sequences. Practically speaking, however, "base pair" and "bases" are equivalent, so the length is 200 regardless. BTW, wc -c will count the number of characters in a line for you (you'll need to subtract 1 from the result).

ADD COMMENT • link 10.1 years ago by Devon Ryan 105k

1

Entering edit mode

Devon, you can use echo -n(no need to subtract 1) :-)

echo -n ACGT | wc -c
4

dpidad, try fastqc for measuring lengths of your reads in fastq. More solutions here: Sequence Length Distribution From A Fastq File

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by PoGibas 5.1k

0

Entering edit mode

Just goes to show, regardless of how long I've been using the CLI, there's always something useful to learn! :o)

ADD REPLY • link 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks for the clarification.

I got this read file from ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR211/SRR211279/SRR211279.sra. When extracted (using fastq-dump) got the SRR211279.fastq file with each reads of length 200bp. However, came across a paper referring "SRR211279 (25.23M 100bp paired-end reads generated by Illumina GAIIx) from the Washington University Genome.
From where to get the SRR211279 100bp paired end read files?

I'm using this files with Soap3-dp, which needs 2 read files for pair-end rund. Venturing new into these topics, any pointers would be helpful.

ADD REPLY • link 10.1 years ago by dpidad • 0

0

Entering edit mode

You forgot the --split-files or --split-spots option. fastq-dump is not the best designed program in the world, since it really should do this automatically.

ADD REPLY • link 10.1 years ago by Devon Ryan 105k