Question

Is there a way to convert a FASTQ file to FASTA file?

0

Entering edit mode

5.7 years ago

as2779 • 0

Hello.

I am trying to use RepeatModeler to identify transposable elements in a genome of C. Remanei. I have a FASTQ file that came from a genome analysis. I'm trying to convert the FASTQ file to a FASTA file with the following format:

> name
ACGCTGCGT..... (sequence)

When I looked around on this site, I saw commands that converts FASTQ to FASTA. However, I used two of such commands and got the same output. For example, the first few lines of my input is:

@NB551191:275:HMT7LBGX7:1:11101:1614:1054 1:N:0:ATCACG
TAAATNAGATCATTTTTGTAGAGAAAAANGANGGCTTNCGAATGGTATGAAAATCTCTGTGATCCGTCAAAAACTGACTGAGTTCTGATAAAAAATGTATTGGCAGAAAATACCACTTGGACCAAATCTCAAAAATTGACGGAAATGTCAC
+
AAAAA#EEEEEEEEEEEEEEEEEAEEEE#EE#EEEEE#EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE/AAAEEEEEAEEEAA<EEAEEEEAEEEEEEEAAEEEE

@NB551191:275:HMT7LBGX7:1:11101:18472:1054 1:N:0:ATCACG
TTTCCNGAAAACGCATCCAGCATTGTTTNACNTCATTNGAGAGCTGAAAATTTTCAAACCTGTATTTTCCAATCGCATAATAACTCGTGTCTCCTTCTCCATAATCCGTGGGAAGCTTTCAACTCAATAAATTTTAGGAAAAAAGTTTATT
+
AAAAA#EEEE6EEE/AEEEEEEEEEEEE#AE#EEEEE#EEEEEEEEEEEE/EEEAEE/EEEEEEEEEEEAE/EEEAAAEAEEEEEEEE/EEEEEEEEEAEE/EE<<EEEAEEEAEE<<<EA/EEAA</AEEEAAEAEEEA/EEEA/EAEAA

> (ad infinitum)

And when I use the command to convert to FASTA, I get this output:

>NB551191:275:HMT7LBGX7:1:11101:1614:1054 1:N:0:ATCACG
TAAATNAGATCATTTTTGTAGAGAAAAANGANGGCTTNCGAATGGTATGAAAATCTCTGTGATCCGTCAAAAACTGACTGAGTTCTGATAAAAAATGTATTGGCAGAAAATACCACTTGGACCAAATCTCAAAAATTGACGGAAATGTCAC
>NB551191:275:HMT7LBGX7:1:11101:18472:1054 1:N:0:ATCACG
TTTCCNGAAAACGCATCCAGCATTGTTTNACNTCATTNGAGAGCTGAAAATTTTCAAACCTGTATTTTCCAATCGCATAATAACTCGTGTCTCCTTCTCCATAATCCGTGGGAAGCTTTCAACTCAATAAATTTTAGGAAAAAAGTTTATT
> (ad infinitum)

This is not the format I want; I want a FASTA file that only contains 1 description and the rest of the file be the sequence. From a FASTQ file, is it possible to obtain this, and if so how do I do so? If not, how should I run the data through RepeatModeler? Thank you for your help!

genome FASTA FASTQ RepeatModeler • 3.4k views

ADD COMMENT • link updated 5.7 years ago by swbarnes2 15k • written 5.7 years ago by as2779 • 0

1

Entering edit mode

Unfortunately what you want to do is not correct, the FastQ files represent some data sequencing of your genome, that means that the genome was fragmented in such small sequences. I guess what you want is to first assemble your reads into contigs and use those to predict/detect repetitive elements.

ADD REPLY • link 5.7 years ago by JC 13k

0

Entering edit mode

Ah that makes sense. Is there some sort of tool to assembly contigs from FASTQ files? I'm also moderately proficient in Python and Java if there's some simple lines of code that I can write to do this.

ADD REPLY • link 5.7 years ago by as2779 • 0

0

Entering edit mode

No. Do not reinvent the wheel. Google for programs that will do what you want. You can likely find answers on biostars that are relevant.

ADD REPLY • link 5.7 years ago by swbarnes2 15k

0

Entering edit mode

Are you sure this is what you need to do?

If so, search is your friend. See, for example: HOw to merge multifasta sequence into a single sequence having only one header?.

ADD REPLY • link 5.7 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I want a FASTA file that only contains 1 description and the rest of the file be the sequence.

You could simply drop all lines that start with > by piping your file through grep -v "^>" and then append a header you want at top.

But these would still be individual reads and not represent what the sequence of the genome is. Which I assume you ultimately want to use with repeatmodeler?

ADD REPLY • link 5.7 years ago by GenoMax 153k

0

Entering edit mode

Yes you are correct. I want the sequence of the whole genome (as a FASTA file) given a FASTQ file. How can I obtain this?

ADD REPLY • link 5.7 years ago by as2779 • 0

score 0 · Answer 1 · 2019-12-09

0

Entering edit mode

5.7 years ago

swbarnes2 15k

The procedure you were given does convert a fastq to fasta. So that's not actually what you want to do.

The reads in a fastq are almost certainly unplaced. You can't just string them together in order and get a sequence that makes sense. I think you want a consensus sequence, so you need to look up how to do that. You can either do de novo assembly, or align to a reference, and make a consensus sequence taking in account the points where your reads differ from the consensus.

ADD COMMENT • link 5.7 years ago by swbarnes2 15k

0

Entering edit mode

Hi, thank you for the response! I'm still new to the field of computational biology. How do I make a consensus sequence? By doing so, I will be able to create a FASTA file?

ADD REPLY • link 5.7 years ago by as2779 • 0