Calculate length of sequences from fasta file
4
Hello all,
I want to calculate the length of sequences from a huge FASTA file. I have read names of which I need to calculate the length from FASTA file. I know how to do it all reads using the following command.
awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' file.fasta
However, I am not sure, how to do it for a specific number of reads. Could someone help me here?
fasta
• 16k views
I would just subset the sequences you need by name using any one of dozens of tools that have been posted on here before, in to a different file, and then just calculate all the lengths in the new file.
Use bioawk, it treats a FASTA/FASTQ record a single awk record, that way you can count the sequences while computing their cumulative length
https://github.com/lh3/bioawk
then counting the cumulative size of the first ten sequences would be
cat reads.fq | bioawk -c fastx 'NR < 11 { size += length($seq) } END { print size }'
output:
$ seqtk subseq test.fa ids.txt | infoseq stdin -only -name -length -pgc
Display basic information about sequences
Name Length %GC
read2 14 42.86
read3 6 50.00
Input :
- huge fasta file: test.fa
$ cat test.fa
>read1
atgctatcat
>read2
atgctactagtcga
>read3
atgctg
>read4
atgc
Sequences of interest (ids.txt)
$ cat ids.txt
read3
read2
BTw, what do you mean by "specific number of reads" ? EMBOSS tools doesn't recognize n esp when calculating gc content. However it does fairly good job in counting characters.
Another solution (just thought to post it since I did it now):
awk -F "" '/^>/{ if(NR!=1) {print len}; printf $0"\t"; len=0}; \
!/^>/{len+=NF}' \
gencode.v28.transcripts.fa |\
head -35
>ENST00000456328.2|ENSG00000223972.5|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript| 1657
>ENST00000450305.2|ENSG00000223972.5|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| 632
>ENST00000488147.1|ENSG00000227232.5|...RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene| 1351
>ENST00000619216.1|ENSG00000278267.1|...|MIR6859-1-201|MIR6859-1|68|miRNA| 68
>ENST00000473358.1|ENSG00000243485.5|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| 712
>ENST00000469289.1|ENSG00000243485.5|...|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| 535
>ENST00000607096.1|ENSG00000284332.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| 138
>ENST00000417324.1|ENSG00000237613.2|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| 1187
>ENST00000461467.1|ENSG00000237613.2|...|RP11-34P13.4-002|FAM138A|590|lincRNA| 590
>ENST00000606857.1|ENSG00000268020.3|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene| 840
>ENST00000642116.1|ENSG00000240361.2|...|RP11-34P13.6-002|OR4G11P|1414|processed_transcript| 1414
>ENST00000492842.2|ENSG00000240361.2|...|RP11-34P13.6-001|OR4G11P|939|transcribed_unprocessed_pseudogene| 939
>ENST00000641515.2|ENSG00000186092.6|...|RP11-34P13.5-001|OR4F5|2618|protein_coding| 2618
>ENST00000335137.4|ENSG00000186092.6|...|OR4F5-201|OR4F5|1054|protein_coding| 1054
>ENST00000466430.5|ENSG00000238009.6|...|RP11-34P13.7-001|RP11-34P13.7|2748|lincRNA| 2748
>ENST00000477740.5|ENSG00000238009.6|...|RP11-34P13.7-003|RP11-34P13.7|491|lincRNA| 491
>ENST00000471248.1|ENSG00000238009.6|...|RP11-34P13.7-002|RP11-34P13.7|629|lincRNA| 629
>ENST00000610542.1|ENSG00000238009.6|...|AL627309.1-205|RP11-34P13.7|723|lincRNA| 723
>ENST00000453576.2|ENSG00000238009.6|...|RP11-34P13.7-004|RP11-34P13.7|336|lincRNA| 336
>ENST00000495576.1|ENSG00000239945.1|...|RP11-34P13.8-001|RP11-34P13.8|1319|lincRNA| 1319
>ENST00000442987.3|ENSG00000233750.3|...|RP11-34P13.10-001|RP11-34P13.10|3812|processed_pseudogene| 3812
>ENST00000494149.2|ENSG00000268903.1|...|RP11-34P13.15-001|RP11-34P13.15|755|processed_pseudogene| 755
>ENST00000595919.1|ENSG00000269981.1|...|RP11-34P13.16-001|RP11-34P13.16|284|processed_pseudogene| 284
Login before adding your answer.
Traffic: 2324 users visited in the last hour
it's not clear what is a "read" in this context.
Read: meaning sequence name
>read1
atgctatcat
>read2
atgctactagtcga
just use
head
?The simple way is like ./seqkit fx2tab -l file.fasta > file_seq_with_length.txt
And further you can extract particular column using awk e.g. awk '{print $1, $3}' file_seq_with_length.txt > extrcted_length.txt Usually, your first column will contain Ids and the last column will contain sequence length.
Hoping it may help someone.... :)
OP wants % GC content in addition to length @ archana.bioinfo87. Probably following command with seqkit would work:
test.fa being input fasta and ids.txt being text file with ids without ">".