I have a master file F1.fasta and it has to be split into 2 files: F2.fasta and F3.fasta.
Criteria to load data (both id and sequence) into F3.fasta is if there are more than 6 sequences of A's or T's or if more than 40%[Aa]’s or 40%[Tt]’s present. Everything else is loaded into F2.fasta. This is removing some false positive. See this: How do I use a regex to remove a line from a FASTA file if it has more than 6 'A' or 6 'T' sequentially present in a line?
I have a fourth file F4.txt which has patterns in each line e.g. 'ATATTTTTTA', 'AAATAAAT'. Based on the pattern on this file I want to do machine learning and I want to load data into file F2.fasta or F3.fasta.
I am looking for a Python-based solution (Pyspark, Scikit-learn etc., etc.)
-
Are there any machine learning libraries out there to work with Fastq/Fasta data?
-
Why? - Give some context as to your end game.
Umm, why is this phrased in terms of machine learning? It's just straight programming, there's no actual learning taking place here.
Are there any machine learning libraries out there to work with Fastq/Fasta data?
I don't think the headline is in accordance with the body of the question, and the problem you describe does not sound like machine learning.
The above is totally deterministic, so not machine learning.
'Machine learning' is too broad, you need to first describe your problem in the frame-work of machine learning. What is the algorithm going to learn? What is the training and test-data, what is the ground truth?
This is irrelevant, don't choose your tool first and then search for a problem.