Question

How to use machine learning in Fastq/Fasta data?

2

Entering edit mode

7.5 years ago

inkprs ▴ 70

I have a master file F1.fasta and it has to be split into 2 files: F2.fasta and F3.fasta.

Criteria to load data (both id and sequence) into F3.fasta is if there are more than 6 sequences of A's or T's or if more than 40%[Aa]’s or 40%[Tt]’s present. Everything else is loaded into F2.fasta. This is removing some false positive. See this: How do I use a regex to remove a line from a FASTA file if it has more than 6 'A' or 6 'T' sequentially present in a line?

I have a fourth file F4.txt which has patterns in each line e.g. 'ATATTTTTTA', 'AAATAAAT'. Based on the pattern on this file I want to do machine learning and I want to load data into file F2.fasta or F3.fasta.

I am looking for a Python-based solution (Pyspark, Scikit-learn etc., etc.)

-

Are there any machine learning libraries out there to work with Fastq/Fasta data?

-

machine learning fasta fastq sequencing • 3.4k views

ADD COMMENT • link 7.5 years ago by inkprs ▴ 70

3

Entering edit mode

Why? - Give some context as to your end game.

ADD REPLY • link 7.5 years ago by andrew.j.skelton73 6.6k

1

Entering edit mode

Umm, why is this phrased in terms of machine learning? It's just straight programming, there's no actual learning taking place here.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

2

Entering edit mode

Are there any machine learning libraries out there to work with Fastq/Fasta data?

ADD REPLY • link 7.5 years ago by inkprs ▴ 70

0

Entering edit mode

I don't think the headline is in accordance with the body of the question, and the problem you describe does not sound like machine learning.

Criteria to load data (both id and sequence) into F3.fasta is if there are more than 6 sequences of A's or T's or if more than 40%[Aa]’s or 40%[Tt]’s present. Everything else is loaded into F2.fasta. This is removing some false positive.

The above is totally deterministic, so not machine learning.

I have a fourth file F4.txt which has patterns in each line e.g. 'ATATTTTTTA', 'AAATAAAT'. Based on the pattern on this file I want to do machine learning and I want to load data into file F2.fasta or F3.fasta.

'Machine learning' is too broad, you need to first describe your problem in the frame-work of machine learning. What is the algorithm going to learn? What is the training and test-data, what is the ground truth?

I am looking for a Python-based solution (Pyspark, Scikit-learn etc., etc.)

This is irrelevant, don't choose your tool first and then search for a problem.

ADD REPLY • link 7.5 years ago by Michael 55k