MutScan (https://github.com/OpenGene/MutScan)

Question

Tutorial:MutScan: Detect important mutations by scanning FastQ files directly

10

Entering edit mode

8.2 years ago

chen ★ 2.5k

MutScan (https://github.com/OpenGene/MutScan)

Ultra sensitive
20X+ faster than normal pipeline (i.e. BWA + Samtools + GATK/VarScan/Mutect)
Very easy to use. Need nothing else. No alignment, no reference assembly, no variant call, no pileup...
Beautiful HTML report
Multi-threading support
Support both single-end and pair-end data
For pair-end data, MutScan will try to merge each pair, and do quality adjustment and error correction

Download

# download use http
https://github.com/OpenGene/MutScan/archive/master.zip

# or download use git
git clone https://github.com/OpenGene/MutScan.git

Build

cd MutScan
make

Usage

usage: mutscan -1 <read1_file> -2 <read2_file> -m <mutation_file> -h <html_report_file> -t <thread>  
options:
  -1, --read1       read1 file name (string)
  -2, --read2       read2 file name (string)
  -m, --mutation    optional, mutation file name (string)
  -h, --html        optional, filename of html report, no html report if not specified (string)
  -?, --help        print this message
  -t, --thread      thread number, default 4 (int)

The plain text result, contains the detected mutations and their support reads, will be printed directly. You can use > to redirect output to a file, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> > result.txt

And you can make a HTML file report with -h argument, like:

mutscan -1 <read1_file_name> -2 <read2_file_name> -m <mutation_file_name> -h report.html

single-end and pair-end

For single-end sequencing data, -2 argument is omitted:

mutscan -1 <read1_file_name> -m <mutation_file_name>

Mutation file

A CSV file with columns of name, left_seq_of_mutation_point, mutation_seq and right_seq_of_mutation_point

#name, left_seq_of_mutation_point, mutation_seq, right_seq_of_mutation_point
NRAS-neg-1-115258748-2-c.34G>A-p.G12S-COSM563, GGATTGTCAGTGCGCTTTTCCCAACACCAC, T, TGCTCCAACCACCACCAGTTTGTACTCAGT
NRAS-neg-1-115252203-2-c.437C>T-p.A146V-COSM4170228, TGAAAGCTGTACCATACCTGTCTGGTCTTG, A, CTGAGGTTTCAATGAATGGAATCCCGTAAC
BRAF-neg-7-140453136-15-c.1799T>A -V600E-COSM476, AACTGATGGGACCCACTCCATCGAGATTTC, T, CTGTAGCTAGACCAAAATCACCTATTTTTA
EGFR-pos-7-55241677-18-c.2125G>A-p.E709K-COSM12988, CCCAACCAAGCTCTCTTGAGGATCTTGAAG, A, AAACTGAATTCAAAAAGATCAAAGTGCTGG
EGFR-pos-7-55241707-18-c.2155G>A-p.G719S-COSM6252, GAAACTGAATTCAAAAAGATCAAAGTGCTG, A, GCTCCGGTGCGTTCGGCACGGTGTATAAGG
EGFR-pos-7-55241707-18-c.2155G>T-p.G719C-COSM6253, GAAACTGAATTCAAAAAGATCAAAGTGCTG, T, GCTCCGGTGCGTTCGGCACGGTGTATAAGG

A default CSV file contains important actionable cancer gene targets is already provided in mutation/cancer.csv. If you want to use this mutation file directly, the argument mutation_file_name can be omitted:

mutscan -1 <read1_file_name> -2 <read2_file_name>

HTML output

If -h or --html argument is given, then a HTML report will be generated, and written to the given filename. A sample report is given here:

The color of each base indicates its quality, and the quality will be shown when mouse over.

fastq mutation cancer target • 7.4k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 8.2 years ago by chen ★ 2.5k

0

Entering edit mode

Cool. How does this work? Are you doing some kind of fuzzy k-mer alignment to target genes?

ADD REPLY • link 8.2 years ago by Damian Kao 16k

0

Entering edit mode

Yes. Basically this is an implementation of sequence string searching algorithm. But with support of error tolerance, quality handling and other sequence related features.

ADD REPLY • link 8.2 years ago by chen ★ 2.5k

0

Entering edit mode

Am I correct in assuming that this tool is designed for human samples only?

ADD REPLY • link 8.2 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

No, you can specify any sequence in the mutation list CSV file.

ADD REPLY • link 8.2 years ago by chen ★ 2.5k

0

Entering edit mode

Can we use this for RNAseq data?

ADD REPLY • link 8.2 years ago by Ron ★ 1.2k

0

Entering edit mode

Sure, it is just sequence. But protein sequence is not supported yet.

ADD REPLY • link 8.2 years ago by chen ★ 2.5k

0

Entering edit mode

How open this program? I don't understand. Help me please.

ADD REPLY • link 4.6 years ago by andrre1520 • 0

score 5 · Answer 1 · 2016-09-28

If you're looking for some feedback:

1) Output a VCF file. You can also output a CSV file, or have them as selectable options, but at the very least give the option of outputting a properly formatted VCF file. If you want to gain traction as a tool you need to conform to widely adopted standards and fit into people's workflows.

2) Your CSV file, jamming all of the extra annotation info into a name field can be useful as a shortcut, and I can see the appeal for it creating a unique ID, but it also makes it less useful in the end, particularly if I'm dealing with a CSV file with potentially a large number of variants in it. If I'm going to load that into an excel file I want the info in that name field you have to be separate columns. And while I can do splits on additional characters when I import the data that creates an extra unnecessary step and makes sharing the raw CSV file with less computationally savvy colleagues less appealing.

3) The HTML report looks nice but doesn't seem to highlight the "TestMutation" very obviously, at least when I glance at the report.

4) If the mutation file is optional and it doesn't use any sort of reference, is the extra info about the mutation that you have in the name field added? I'm assuming it is using this mutation file as the "reference" for annotation?

Otherwise this looks interesting and I have a bunch of data to test it on.

score 2 · Answer 2 · 2016-09-28

I'm very skeptical that this will perform anywhere near as well as a 'normal' SNP calling pipeline. I don't doubt i'm probably missing something though. Maybe a brain. Can you confirm that:

Adapters and poor-quality bases will be removed and not called as variants, even though the adapter sequence isn't known to your algorithm?
The optical/pcr duplicates will be marked and discarded?
All the usual post-mapping work that informs SNP calling (such as realignment around indels) is done, or is not an issue due to the way you make your graphs?
You can detect frameshift mutations anywhere in the gene body without specifying exactly what that mutation should look like.
How can you produce VCFs without mapping to the genome? I suppose you cannot. Which suggests this would be incompatible with all the other down-stream variant calling tools out there.

Personally I think the 20x faster claim is like comparing apples to oranges. My gut-instinct is to think there's no way this approach, with less information about the genome, can possibly detect variants as well as the 'normal' method. Worse, I suspect it will be used inappropriately by people looking to cut corners.

On the positive side, I suppose to even check 1 entry in the CSV file, you have to build the entire graph. If you could write-out the graph after building it, I can see a number of time-memory tradeoff techniques could be developed in the future. This would also be the go-to tool for looking at variants in organisms without sequenced/annotated genomes. Although it's debatable how your CSV file would look on a organisms without a sequenced genome...

score 2 · Answer 3 · 2016-09-29

2

Entering edit mode

8.2 years ago

harold.smith.tarheel ★ 5.0k

Since I didn't get a response to my comment/questions, I'll repost as an answer:

1) I can envision how the tool might add annotations to variants that are listed in the optional mutation file. But how does it treat de novo mutations, or cases where the mutation file is not provided?

2) Absent a reference and mutation file, how do you call a homozygous variant?

3) You state that the tool is useful for ultra-low frequency mutations. How do you discriminate those from common sequencing errors?

ADD COMMENT • link 8.2 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

This tool is not a variant caller. It just helps eliminating the false negatives for low frequency mutation detection.
Some applications, like circulating tumor DNA sequencing, the mutated reads are usually very few, and may be not detected by normal pipelines.
In this case, this tool can scan the important mutation locus (like EGFR L858R, which makes patients sensitive to EGFR TKI treatment) to check for false negative.
If you have experience with ctDNA sequencing, you may understand why I developed this.

ADD REPLY • link 8.2 years ago by chen ★ 2.5k

score 1 · Answer 4 · 2016-09-28

I'm wondering if this approach can be adapted to detect closely-related strains in metagenomics samples. If, for instance, you have a region of a relatively conserved gene you will see some variation in it when you have close strains or species. Since mapping is usually not an option and assembly would probably miss these kind of variations, using such a tool will give very important input for how to assemble the sample (error tolerance etc.) Any thoughts?

score 0 · Answer 5 · 2016-09-28

0

Entering edit mode

8.2 years ago

chen ★ 2.5k

This tool can be very useful for cancer somatic mutation detection, especially for detecting ultra-low frequency mutation from deep sequencing data.

This tool can be used directly in liquid biopsy, like ctDNA sequencing.

ADD COMMENT • link 8.2 years ago by chen ★ 2.5k

1

Entering edit mode

Yes, in theory it works, but in practice you need to prove it works to high accuracy and works better than other alternatives.

ADD REPLY • link 8.2 years ago by lh3 33k

0

Entering edit mode

Thanks!

We're just testing it with ~1000 cfDNA samples. I will update the result once it's done.

ADD REPLY • link 8.2 years ago by chen ★ 2.5k

0

Entering edit mode

Hi Chen,

Were you able to test out 1000 cfDNA samples? I would be very interested in knowing the answer.

Thanks

ADD REPLY • link 7.4 years ago by caspase8mach ▴ 30

score 0 · Answer 6 · 2018-01-17

Hi Chen,

I was trying to run the stable releases of MutScan on my data, by executing the following command: ./mutscan -1 my_reads.fq -m my_mutations.csv -S 1

I was expecting to see output for variants, for which at least a single read match the mutation specified in the my_mutations.csv. But I had obtained the following: No mutation will be scanned Scanning 0 mutations... Loaded all of 1000 reads MutScan didn't find any mutation. However, I'm sure that there are multiple reads supporting the mutation.

May I ask you to provide some toy data files, so that I could run the program. Or provide a command which will be executable on data already present in /testdata repository.

Thank you in advance