Question

Calculate length of sequences from fasta file

1

Entering edit mode

7.0 years ago

KVC_bioinfo ▴ 600

Hello all,

I want to calculate the length of sequences from a huge FASTA file. I have read names of which I need to calculate the length from FASTA file. I know how to do it all reads using the following command.

awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' file.fasta

However, I am not sure, how to do it for a specific number of reads. Could someone help me here?

fasta • 16k views

ADD COMMENT • link updated 6.3 years ago by Biostar 20 • written 7.0 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

it's not clear what is a "read" in this context.

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Read: meaning sequence name

>read1

atgctatcat

>read2

atgctactagtcga

ADD REPLY • link 7.0 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

However, I am not sure, how to do it for a specific number of reads

just use head ?

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The simple way is like ./seqkit fx2tab -l file.fasta > file_seq_with_length.txt

And further you can extract particular column using awk e.g. awk '{print $1, $3}' file_seq_with_length.txt > extrcted_length.txt Usually, your first column will contain Ids and the last column will contain sequence length.

Hoping it may help someone.... :)

ADD REPLY • link 5.6 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

OP wants % GC content in addition to length @ archana.bioinfo87. Probably following command with seqkit would work:

$ seqkit fx2tab -lg test.fa | grep -f ids.txt

test.fa being input fasta and ids.txt being text file with ids without ">".

ADD REPLY • link 5.6 years ago by cpad0112 21k

3

Entering edit mode

7.0 years ago

Istvan Albert 102k

Use bioawk, it treats a FASTA/FASTQ record a single awk record, that way you can count the sequences while computing their cumulative length

https://github.com/lh3/bioawk

then counting the cumulative size of the first ten sequences would be

cat reads.fq | bioawk -c fastx 'NR < 11 { size += length($seq) } END { print size }'

ADD COMMENT • link 7.0 years ago by Istvan Albert 102k

3

Entering edit mode

7.0 years ago

cpad0112 21k

output:

$ seqtk subseq test.fa ids.txt | infoseq stdin -only -name -length -pgc
Display basic information about sequences
Name           Length %GC    
read2          14     42.86  
read3          6      50.00

Input : - huge fasta file: test.fa

$ cat test.fa 
>read1
atgctatcat
>read2
atgctactagtcga
>read3
atgctg
>read4
atgc

Sequences of interest (ids.txt)

$ cat ids.txt 
read3
read2

BTw, what do you mean by "specific number of reads" ? EMBOSS tools doesn't recognize n esp when calculating gc content. However it does fairly good job in counting characters.

ADD COMMENT • link 7.0 years ago by cpad0112 21k

3

Entering edit mode

6.4 years ago

Kevin Blighe 88k

Another solution (just thought to post it since I did it now):

awk -F "" '/^>/{ if(NR!=1) {print len}; printf $0"\t"; len=0}; \
  !/^>/{len+=NF}' \
  gencode.v28.transcripts.fa |\
  head -35

>ENST00000456328.2|ENSG00000223972.5|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|    1657
>ENST00000450305.2|ENSG00000223972.5|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|   632
>ENST00000488147.1|ENSG00000227232.5|...RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|    1351
>ENST00000619216.1|ENSG00000278267.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|  68
>ENST00000473358.1|ENSG00000243485.5|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| 712
>ENST00000469289.1|ENSG00000243485.5|...|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| 535
>ENST00000607096.1|ENSG00000284332.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| 138
>ENST00000417324.1|ENSG00000237613.2|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| 1187
>ENST00000461467.1|ENSG00000237613.2|...|RP11-34P13.4-002|FAM138A|590|lincRNA|  590
>ENST00000606857.1|ENSG00000268020.3|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|   840
>ENST00000642116.1|ENSG00000240361.2|...|RP11-34P13.6-002|OR4G11P|1414|processed_transcript|    1414
>ENST00000492842.2|ENSG00000240361.2|...|RP11-34P13.6-001|OR4G11P|939|transcribed_unprocessed_pseudogene|   939
>ENST00000641515.2|ENSG00000186092.6|...|RP11-34P13.5-001|OR4F5|2618|protein_coding|    2618
>ENST00000335137.4|ENSG00000186092.6|...|OR4F5-201|OR4F5|1054|protein_coding|   1054
>ENST00000466430.5|ENSG00000238009.6|...|RP11-34P13.7-001|RP11-34P13.7|2748|lincRNA|    2748
>ENST00000477740.5|ENSG00000238009.6|...|RP11-34P13.7-003|RP11-34P13.7|491|lincRNA| 491
>ENST00000471248.1|ENSG00000238009.6|...|RP11-34P13.7-002|RP11-34P13.7|629|lincRNA| 629
>ENST00000610542.1|ENSG00000238009.6|...|AL627309.1-205|RP11-34P13.7|723|lincRNA|   723
>ENST00000453576.2|ENSG00000238009.6|...|RP11-34P13.7-004|RP11-34P13.7|336|lincRNA| 336
>ENST00000495576.1|ENSG00000239945.1|...|RP11-34P13.8-001|RP11-34P13.8|1319|lincRNA|    1319
>ENST00000442987.3|ENSG00000233750.3|...|RP11-34P13.10-001|RP11-34P13.10|3812|processed_pseudogene| 3812
>ENST00000494149.2|ENSG00000268903.1|...|RP11-34P13.15-001|RP11-34P13.15|755|processed_pseudogene|  755
>ENST00000595919.1|ENSG00000269981.1|...|RP11-34P13.16-001|RP11-34P13.16|284|processed_pseudogene|  284

ADD COMMENT • link 5.6 years ago by Kevin Blighe 88k

score 2 · Accepted Answer · 2017-11-29

2

Entering edit mode

7.0 years ago

Joe 21k

I would just subset the sequences you need by name using any one of dozens of tools that have been posted on here before, in to a different file, and then just calculate all the lengths in the new file.

ADD COMMENT • link 7.0 years ago by Joe 21k

1

Entering edit mode

Thanks. It made my job easy

ADD REPLY • link 7.0 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

Be sure to select something as an answer if it helped solve your problem.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

sure. Thank you. will do that

ADD REPLY • link 7.0 years ago by KVC_bioinfo ▴ 600