Question

A layman trying to understand what a FASTA file is

0

Entering edit mode

3.7 years ago

GoldenRetriever ▴ 40

I am a student in Computer Science and want to understand what a FASTA file is.

I have downloaded: blood SRR403015 393663256 (not that I know what it is) from: https://trace.ncbi.nlm.nih.gov/Traces/sra/?view=search_seq_name&exp=SRX118102&run=&m=search&s=seq.

I am going to construct a data structure that holds the sequences in that file. However, I realize that there are duplicates. Fx.: the sequence "TTTCGAAGCATCTTTTGGGCAAACTTCTTTCTCAGGCGCTTGATCTTCA" occurs 11 times. And if this holds for every sequence, the files has a lot of duplicates!

Why does the file contain duplicates?
Can I delete duplicates or are they needed to know how many occurrences a file has of a specific gene? I mean, my data structure is going be used for queries like "Is this gene X in here".
What kind of sequences is it?

I hope you can help me with above. Thanks in advance.

RNA-Seq gene next-gen-sequencing • 1.4k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 3.7 years ago by GoldenRetriever ▴ 40

score 3 · Answer 1 · 2021-03-17

3

Entering edit mode

3.7 years ago

ATpoint 85k

What you have there is a RNA-seq dataset, the common raw data format would be fastq, not fasta. There is plenty of documentation available on the web that explains what RNA-seq, fasta and fastq is, and what you can do with it. I am not going to write that all down here. For starters you should get familiar with RNA-seq, e.g.

https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021255

https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

...plus the plenty of materials available online. No, do not remove duplicates. Please google for duplicates + RNA-seq once you understand what the files are, this has been discussed many times before. To make the data "usable" you have to align them to a genome/transcriptome, this is covered in the linked articles.

In short:

1) why duplicates, because the same piece of (c)DNA has been sequenced multiple times, this can either be a true duplicate or a PCR artifact.

2) it is slightly more complicated than "is my gene in there", I think that will clarify after reading the linked articles

3) RNA-seq

ADD COMMENT • link 3.7 years ago by ATpoint 85k

0

Entering edit mode

The articles are a bit too technical for me to understand them fully. However, allow me to ask the following:

The file I have sent contains lots of smaller RNA strings (around 70 bp). Why are there so many small sequences?
Most importantly. Let's say I build a large data structure (DS) that contains 10 files on blood (like the one I have linked to). Could I go ahead and query my DS for, say RNA from a mouse? And what happens if I increase my DS to also hold files on brain RNA, and so on.

ADD REPLY • link 3.7 years ago by GoldenRetriever ▴ 40

0

Entering edit mode

You have 70 bp strings because that is what has been sequenced. You start with long RNA which are fragmented over the process of making libraries (collections of smaller pieces) and then sequenced. Since these fragments are from longer pieces you will find that many of them will match in an overlapping fashion to a longer reference (setting aside splicing if you know what that is).
It is probably more efficient to create a database (e.g. suffix trie and other structures) of the reference RNA and then align your strings. Which is what most aligners do. There are plenty of NGS data aligners so take a look at what is available first.

ADD REPLY • link 3.7 years ago by GenoMax 147k

0

Entering edit mode

OK. I am building something called a Sequence Bloom Tree. The idea is to take an experiment, calculate all k-mers hash them to a Bloom filter. So, what I do is: I take the file and concatenate all sequences (because the file is divided into many sequences as mentioned) and then get all k-mers for the Bloom filter. Am I allowed to concatenate the sequences to one big sequence and then split them based on kmers?
Is the file I link to one experiement? or is it many experiments in one file?

ADD REPLY • link 3.7 years ago by GoldenRetriever ▴ 40

0

Entering edit mode

If you concatenate the sequences you're creating artificial k-mers that aren't present in the data. You probably don't want to do that. The file you linked to is a part of a single experiment (it's one of presumably many samples).

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

I am following this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/, where he creates kmers (k=20) on all files that he refers to. Those files are just experiments, and actually the file I've sent is from the same list of files he uses. What I have assumed is that, he concatenates the files to one big RNA string, and then he produces the kmers. Those kmers are them hashed to a Bloom filter and the Bloom filter is inserted into a binary tree.

He says something about canonical kmers (page 5, "Building bloom filters")

ADD REPLY • link 3.7 years ago by GoldenRetriever ▴ 40

score 2 · Answer 2 · 2021-03-17

2

Entering edit mode

3.7 years ago

patelk26 ▴ 320

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences. In this case, it is a nucleic acid sequence. It is not uncommon to find that in many organisms, a significant fraction of the genomic DNA is highly repetitive. About over two-thirds of the sequence consists of repetitive elements in human DNA.