I am a student in Computer Science and want to understand what a FASTA file is.
I have downloaded: blood SRR403015 393663256 (not that I know what it is) from: https://trace.ncbi.nlm.nih.gov/Traces/sra/?view=search_seq_name&exp=SRX118102&run=&m=search&s=seq.
I am going to construct a data structure that holds the sequences in that file. However, I realize that there are duplicates. Fx.: the sequence "TTTCGAAGCATCTTTTGGGCAAACTTCTTTCTCAGGCGCTTGATCTTCA" occurs 11 times. And if this holds for every sequence, the files has a lot of duplicates!
- Why does the file contain duplicates?
- Can I delete duplicates or are they needed to know how many occurrences a file has of a specific gene? I mean, my data structure is going be used for queries like "Is this gene X in here".
- What kind of sequences is it?
I hope you can help me with above. Thanks in advance.
The articles are a bit too technical for me to understand them fully. However, allow me to ask the following:
If you concatenate the sequences you're creating artificial k-mers that aren't present in the data. You probably don't want to do that. The file you linked to is a part of a single experiment (it's one of presumably many samples).
I am following this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/, where he creates kmers (k=20) on all files that he refers to. Those files are just experiments, and actually the file I've sent is from the same list of files he uses. What I have assumed is that, he concatenates the files to one big RNA string, and then he produces the kmers. Those kmers are them hashed to a Bloom filter and the Bloom filter is inserted into a binary tree.
He says something about canonical kmers (page 5, "Building bloom filters")