A layman trying to understand what a FASTA file is
2
0
Entering edit mode
3.7 years ago

I am a student in Computer Science and want to understand what a FASTA file is.

I have downloaded: blood SRR403015 393663256 (not that I know what it is) from: https://trace.ncbi.nlm.nih.gov/Traces/sra/?view=search_seq_name&exp=SRX118102&run=&m=search&s=seq.

I am going to construct a data structure that holds the sequences in that file. However, I realize that there are duplicates. Fx.: the sequence "TTTCGAAGCATCTTTTGGGCAAACTTCTTTCTCAGGCGCTTGATCTTCA" occurs 11 times. And if this holds for every sequence, the files has a lot of duplicates!

  1. Why does the file contain duplicates?
  2. Can I delete duplicates or are they needed to know how many occurrences a file has of a specific gene? I mean, my data structure is going be used for queries like "Is this gene X in here".
  3. What kind of sequences is it?

I hope you can help me with above. Thanks in advance.

RNA-Seq gene next-gen-sequencing • 1.4k views
ADD COMMENT
3
Entering edit mode
3.7 years ago
ATpoint 85k

What you have there is a RNA-seq dataset, the common raw data format would be fastq, not fasta. There is plenty of documentation available on the web that explains what RNA-seq, fasta and fastq is, and what you can do with it. I am not going to write that all down here. For starters you should get familiar with RNA-seq, e.g.

https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021255

https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

...plus the plenty of materials available online. No, do not remove duplicates. Please google for duplicates + RNA-seq once you understand what the files are, this has been discussed many times before. To make the data "usable" you have to align them to a genome/transcriptome, this is covered in the linked articles.

In short:

1) why duplicates, because the same piece of (c)DNA has been sequenced multiple times, this can either be a true duplicate or a PCR artifact.

2) it is slightly more complicated than "is my gene in there", I think that will clarify after reading the linked articles

3) RNA-seq

ADD COMMENT
0
Entering edit mode

The articles are a bit too technical for me to understand them fully. However, allow me to ask the following:

  1. The file I have sent contains lots of smaller RNA strings (around 70 bp). Why are there so many small sequences?
  2. Most importantly. Let's say I build a large data structure (DS) that contains 10 files on blood (like the one I have linked to). Could I go ahead and query my DS for, say RNA from a mouse? And what happens if I increase my DS to also hold files on brain RNA, and so on.
ADD REPLY
0
Entering edit mode
  1. You have 70 bp strings because that is what has been sequenced. You start with long RNA which are fragmented over the process of making libraries (collections of smaller pieces) and then sequenced. Since these fragments are from longer pieces you will find that many of them will match in an overlapping fashion to a longer reference (setting aside splicing if you know what that is).
  2. It is probably more efficient to create a database (e.g. suffix trie and other structures) of the reference RNA and then align your strings. Which is what most aligners do. There are plenty of NGS data aligners so take a look at what is available first.
ADD REPLY
0
Entering edit mode
  1. OK. I am building something called a Sequence Bloom Tree. The idea is to take an experiment, calculate all k-mers hash them to a Bloom filter. So, what I do is: I take the file and concatenate all sequences (because the file is divided into many sequences as mentioned) and then get all k-mers for the Bloom filter. Am I allowed to concatenate the sequences to one big sequence and then split them based on kmers?
  2. Is the file I link to one experiement? or is it many experiments in one file?
ADD REPLY
0
Entering edit mode

If you concatenate the sequences you're creating artificial k-mers that aren't present in the data. You probably don't want to do that. The file you linked to is a part of a single experiment (it's one of presumably many samples).

ADD REPLY
0
Entering edit mode

I am following this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/, where he creates kmers (k=20) on all files that he refers to. Those files are just experiments, and actually the file I've sent is from the same list of files he uses. What I have assumed is that, he concatenates the files to one big RNA string, and then he produces the kmers. Those kmers are them hashed to a Bloom filter and the Bloom filter is inserted into a binary tree.

He says something about canonical kmers (page 5, "Building bloom filters")

ADD REPLY
2
Entering edit mode
3.7 years ago
patelk26 ▴ 320

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences. In this case, it is a nucleic acid sequence. It is not uncommon to find that in many organisms, a significant fraction of the genomic DNA is highly repetitive. About over two-thirds of the sequence consists of repetitive elements in human DNA.

ADD COMMENT
0
Entering edit mode

aaaah okay. I just added a question (#2). Do you know anything about that?

ADD REPLY

Login before adding your answer.

Traffic: 1580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6