Question

convert paired fastq files to fasta with EMBOSS:seqret or FASTX-toolkit:fastq_to_fasta

0

Entering edit mode

6.1 years ago

marongiu.luigi ▴ 730

Dear all,

I have converted a BAM file into paired fastq files with:

bamToFastq --bam {file}.bam -fq {file}_1.fq -fq2 {file}_2.fq

Now I need to obtain a fasta file from these. I can see that EMBOSS' seqret and FASTX-Toolkit's fastq_to_fasta can do that; the manual for the operation are given for instance here:

seqret -sequence reads.fastq -outseq reads.fasta
fastq_to_fasta ... -i {file}.fq -o {file}.fa

However, the syntax os for a single fastq file input. How do I apply them on paired fastq files?

fasta fastq conversion EMBOSS FASTX • 4.7k views

ADD COMMENT • link updated 6.1 years ago by finswimmer 16k • written 6.1 years ago by marongiu.luigi ▴ 730

score 2 · Answer 1 · 2018-11-20

2

Entering edit mode

6.1 years ago

GenoMax 148k

Convert them independently/individually.

Otherwise use reformat.sh from BBMap suite.

reformat.sh in=file.fq out=file.fa

I have not tried it but reformat.sh in1=R1.fq in2=R2.fq out1=R1.fa out2=R2.fa may work.

ADD COMMENT • link 6.1 years ago by GenoMax 148k

0

Entering edit mode

Won't a loose information by doing it independently?

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Won't a loose information by doing it independently?

You are only stripping away the quality scores not touching the sequence. If your files are not in sync then you would want to use repair.sh first.

ADD REPLY • link 6.1 years ago by GenoMax 148k

score 1 · Answer 2 · 2018-11-20

1

Entering edit mode

6.1 years ago

finswimmer 16k

Hello marongiu.luigi ,

you can use samtools fasta to convert bam directly to fasta. To do this you need to sort the bam file by readname:

$ samtools sort -n input.bam > namesorted.bam
$ samtools fasta -1 1.fa -2 2.fa namesorted.bam

fin swimmer

ADD COMMENT • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

I tried that but, in my hands, this gave me a pile of lines where simply the header @<something> was converted into ><something>; the result was, therefore, a multi-fasta. That is:

$ head 1.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG

I need instead to merge all the fastq lines into a single stretch to generate a single-entry fasta file.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

I need instead to merge all the fastq lines into a single stretch to generate a single-entry fasta file.

That is not what you were asking in original question. Two programs you had originally mentioned only convert fastq to fasta format. Nothing else. This is a completely different need.

What does this mean? You want to merge individual R1/R2 reads and then make a longer representation in fasta format? Or you want to assemble all that sequence data?

ADD REPLY • link 6.1 years ago by GenoMax 148k

0

Entering edit mode

Or maybe get the consensus sequence?

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

That would be great but the methods I have used don't work. For instance I have used:

samtools mpileup -d8000 -f {ref.fa} {bam_output_srt} |  bcftools call -c \
| vcfutils.pl vcf2fq | seqtk seq -aQ64 [-q20] -n N > {outfile.fa}

and

samtools mpileup -AuDS -Q 10 -q 10 -d 10000 -f {ref} {bam_output_srt} \
| bcftools view -cg - | vcfutils.pl vcf2fq | awk \
'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' \
| sed -n '/+/q;p' > {out_file.fa}

But the output is empty. Somethign worng with the syntax perhaps?

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

1

Entering edit mode

Which version of samtools/bcftools are you using? The second one looks like 0.1.19 (which is outdated for 5 years now).

To get the consensus sequence you have to do variant calling first to incoorperate the variants into the reference sequence to get the consensus sequence.

Update to the current samtools/bcftools version (v1.9) first.

$ bcftools mpileup -f reference.fa input.bam|bcftools call -m -v -Ov -o variants.vcf
$ bcftools consensus -f reference.fa variants.vcf > consensus.fa

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
$ bcftools --version
bcftools 1.7
Using htslib 1.7-2
Copyright (C) 2016 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

I updated bcftools

$ bcftools --version
bcftools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

but I don't have a vcf for the reference file. When I launched the first command I got:

[E::hts_open_format] Failed to open file 504-147_variants.vcf
Failed to open 504-147_variants.vcf: No such file or directory
[mpileup] 1 samples in 1 input files

Since no vcf file was created in this step, I could not launch the second command.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Oops, sorry. There was in -o missing before the output file name. Edited it.

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

So, I could create the vcf file but then bcftools consensus wanted it bgzipped so I had to use bgzip 504-147_variants.vcf with bgzip installed with sudo apt install tabix. However, bcftools created a fasta of the reference, not the bam file:

$ bcftools consensus -f ~/refSeq/virChr.fa 504-147_variants.vcf.gz > 504-147.fa
$ head -n 3 504-147.fa
>V dna:chromosome chromosome:GRCh38:V:1:353664355:1 REF
TAGTATTACCCTCGGCACCTTGGAACCAGGAACCACCTTGGAACCCACCCGAGATGGCGG
AGGCTGAAGGCCCGGCGGGGGGAAAGACGCGGCGGCCCCGTGAAGCACCAGCGGCCCGCT

and this is not a multi-fasta with reference plus query since there is only one sequence:

$ grep '>' 504-147.fa
>V dna:chromosome chromosome:GRCh38:V:1:353664355:1 REF

Here, 504-147 is the subset of the original BAM file I am investigating and V dna is the custom genome I am using.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Hello marongiu.luigi ,

I would suggest we go back to start. Please describe what data you have and what you like to find out. Be careful: This is not the same as describing how you think you can solve the problem.

These things should always be in your initial post:

Describing the data you have (if possible with examples)
Describing the goal of the investigation of this data
What you have tried so far and where you have problems

Doing so it is much easier for people to help you, because now one can also tell you, if the way you are trying to go, is the right one.

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

I agree that this post has been split in two. The question I asked was How do I apply them [EMBOSS and/or FASTX] on paired fastq files [both at the same time]?. The answer to that is no. For this, I simply needed the syntax of the command, so I deemed not relevant to give actual data. The second half of the post, the part pointed out by you, that is get the consensus sequence is more relevant for this previous post of mine. Sorry for any inconveniences.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Hello again marongiu.luigi ,

the answer on a "How do I?"-question can never be no. If you think the the get the consensus sequence isn't that relevant here, then what is your question? Or better, which task are you trying to solve?

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

My question was if I could use either of these two tools to process the paired files at once. As genomax pointed out, they Convert fastq to fasta format. Nothing more or less., which I understand meaning no, they can't be used to process both paired files at once. Hence the next question is: how do I process both paired files at once? whose answer is -- among others -- by doing a consensus fasta. But this is now another question that I don't think belongs here anymore.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

I thought I defined the task with the line

Now I need to obtain a fasta file from these.

I know that they convert a single fastq into a single fasta, that is why I was asking whether I can instead

apply them [these tools] on paired fastq files?

Sorry I wasn't clear: I have a pair of fastq file, how do I make a single fasta? I know there might be a gap between the two paired fastq files, this is something I can live with. If it not possible to merge paired fastq into a single fasta, that would be all right.

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Actually, also seqret and fastq_to_fasta do the same thing. This is the output of the fasta file generated by fastq_to_fasta:

$ head -n 5 F2F.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391/1
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592/1
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG
>DHGDKXP1:247:C3MF6ACXX:5:1101:1320:15306/1
GGTTAATCGTGCCAAGAAAAGCGGCATGGTCAATATAACCAGTAGTGTTAACAGTCGGGAGAGGAGTGGCATTAACACCATCCTTCATGAACTTAATCCAC

and this is that of seqret:

$ head -n 4 SQT.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391/1
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGT
ATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592/1
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTC
GCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG

ADD REPLY • link 6.1 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

That is because those tools are designed to do just that. Convert fastq to fasta format. Nothing more or less.

ADD REPLY • link 6.1 years ago by GenoMax 148k