Question

art_illumina use custom error profile to simulate reads

0

Entering edit mode

7.0 years ago

jjmmii • 0

As of the latest version of art_illumina, a popular software to simulate reads with realistic Illumina quality scores, it doesn't have a built-in error profile for MiSeq pair end 300bp. So I have generated my own error profile for that, but how do I specify art_illumina to use this error profile? Because the -ss option only recognizes built-in profiles. I tried:

$ art_illumina -ss ./miseq_pe300_error_profile.txt -l 300 -i 99_otus.fasta -f 1 -na -amp -p -o simulated_pe300_gg_99_otus_16S

but it says it doesn't recognize:

ART does not has a built-in profile for the given sequencing system: ./miseq_pe300_error_profile
List of all built-in profiles are: (...)

I thought of hacking the script and add this custom profile as a built-in but there should be a more convenient way to do this?

software error • 3.1k views

ADD COMMENT • link updated 6.6 years ago by h.mon 35k • written 7.0 years ago by jjmmii • 0

0

Entering edit mode

Have you tried using an older version of the software such as ART-GreatSmokyMountains-04-17-2016 that has a built-in MiSeq profile?

ADD REPLY • link 7.0 years ago by Sej Modha 5.3k

0

Entering edit mode

No, in fact the latest version that I'm using does have MiSeq profile but only up to 250bp. I need a MiSeq 300bp profile. I don't think older versions of art_illumina would have the 300bp profile because it's quite new.

ADD REPLY • link 7.0 years ago by jjmmii • 0

0

Entering edit mode

Edited to include correct URL - Well, in that case, you might want to look into other tools such as BBMap - https://sourceforge.net/projects/bbmap/

ADD REPLY • link 7.0 years ago by Sej Modha 5.3k

0

Entering edit mode

Official BBMap repository is at: https://sourceforge.net/projects/bbmap/

ADD REPLY • link 7.0 years ago by GenoMax 147k

0

Entering edit mode

Thanks @genomax! I've updated my previous comment.

ADD REPLY • link 7.0 years ago by Sej Modha 5.3k

0

Entering edit mode

Unfortunately bbmap requires either a pre-made seqs.qual file or a constant given quality score. The next step of my pipeline (QIIME 2) cannot process a fastq with a constant quality score. And I don't know how to write a script to generate a pre-made seqs.qual file with realistic error profiles for input pair-end reads that are joined already (so they are of different lengths). I finally resorted to using unprocessed FASTQ files (which means redoing the preprocessing steps that have already been done in the FASTA files). Anyway thanks for attempting to help.

ADD REPLY • link 7.0 years ago by jjmmii • 0

0

Entering edit mode

Have you thought about writing to authors of ART to see if they can help you?

Some random thoughts: Is the profile for 300 bp significant different than the one for 250? This profile is going to be subject to type of libraries being sequenced (e.g. diversity, sequencer) so is that going to make a huge difference for whatever you are trying to do. There are plenty of MiSeq datasets now available. Can you use a sampling of them instead of generating simulated data?

ADD REPLY • link 7.0 years ago by GenoMax 147k

0

Entering edit mode

Yes, I wrote to them but no reply :(

Sorry I didn't make myself clear in my question. My ultimate goal isn't to simulate reads. I'm trying to "steal" the quals from simulated reads and add them into the FASTA files with real data. You made a good point that I can steal them from real datasets. But I forgot the second problem which is my FASTA files are joined pair-end reads, so they're actually not 300bp but the size of the 16S amplicon. So I need to find a real dataset of joined sequences in FASTQ format and the amplicon size has to match and hope that every size has a match (because joined sequences' lengths vary by a few base pairs) or write a script to add/subtract a few base pairs. This effort becomes greater than just starting from raw FASTQ files...

One party that I didn't email to yet is the provider of these FASTA files if they have the FASTQ files as well :P So I'll write to them now and see how it goes.

I think of the MiSeq 300 profile as an extension of the 250 profile, as in, the first 250 bp have the same profiles, but from 250 to 300 the quality score continues to drop.

Thanks for listening, haha.

ADD REPLY • link 7.0 years ago by jjmmii • 0

1

Entering edit mode

There should be plenty of datasets that can be merged in that size range. You may need to look around on EBI-ENA. If you can get the raw fastq files then that is the simplest way.

ADD REPLY • link 7.0 years ago by GenoMax 147k

0

Entering edit mode

Ok! Thanks genomax and Sej Modha.

ADD REPLY • link 7.0 years ago by jjmmii • 0

score 1 · Answer 1 · 2018-05-08

1

Entering edit mode

6.6 years ago

h.mon 35k

If you generated custom profiles, the documentation says the parameters to use them are:

    -1   --qprof1   the first-read quality profile
    -2   --qprof2   the second-read quality profile

ADD COMMENT • link 6.6 years ago by h.mon 35k