Aligner logic
2
0
Entering edit mode
2.1 years ago
Lenny186 • 0

Hi,

I'm new to the genomic field and I have few questions. As per my indesrtanding, an aligner (STAR or RSEM or Tophat) is a tool that allow to align fragment from rna-seq experiment. As per my understanding these fragment of nucleotide or read usually measure 63bases.

Aligner logic

For example, according to the following command line :

STAR --runMode genomeGenerate \ --genomeDir GRCh38.79.chrom1 \ --genomeFastaFiles genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa \ --sjdbGTFfile gtf/Homo_sapiens.GRCh38.79.chrom1.gtf \ --sjdbOverhang 62 The downloaded fasta file look like this one :fastaFile

Q1: Each line have 60 length and it does not look like fragments. I understood that GTF file will help to annotate exons and help with precision. But With the commands, what are we aligning ? and where is the reference genome ?

Q2:One other question, in the BAM file bam: which is a compressed form of the SAM resultant aligner. Is one line represent one sample ? if it is; How can we be sure that one read belong to a specific sample and not the another ?

Q3: Finally, is there any useful graph of the pipeline with input/output and software used?

Will be grateful for every piece of information that could help me. Thanks a lot. Lenny

aligner • 1.1k views
ADD COMMENT
4
Entering edit mode
2.1 years ago
ATpoint 85k

Take a cell. It contains mRNA, so essentially billions of RNA molecules derived from its genomic template. These you all chop into pieces and put into a sequencer. The sequencer returns reads which represents the sequences of these chopped pieces.

An aligner maps back these reads to a reference, in case of STAR that is commonly the genome. Since in eukaryotes transcripts are often spliced the GTF file helps STAR to know where in the genome the introns are located so it is able to bridge that gap, so it is splice-aware.

What you see in the screenshot are the individual reads so the sequences of the chopped transcripts with their location in the genome.

A sample is typically a single extraction from RNA. Say you have a mouse and take out the liver, make RNA from it and sequence that, this is a sample. A sample, or rather the sequencing library (that is what you pipet into the sequencer) can be sequenced several time to get more reads. That would be a technical replicate and is commonly merged even before aligning, so you would cat the fastq files together to get a single BAM file.

What you see in this screenshot with the many N's is the human genome, so the plain nucleotide content. Ns are present if the sequence is not known for that particular part, that often happens for repetitive regions, centromers and telomers. It is common to see N at the beginning of a chromosome. Sixty characters per line are just a convention that appeals to the eye, other fasta files (that is this format) may have more characters per line, but basically you can have an arbitrary number of letters (ATCGN) in a single line, nothing to worry about. This indexing is what your STAR --runMode genomeGenerate does. This is not the alignment, it is a preparation step you have to do once per genome. The alignment is then the next step, using this index to map the fastq files.

Does that make sense to you?

As for Q3, which "pipeline" do you refer to?

ADD COMMENT
0
Entering edit mode

Thanks for this reply. Let's say only from sequecing to alignement, the logic as per my understanding is this :

Sequencer (Fasta) ->STAR / Generation of index (BAI) -> STAR/ Aligement BAM

If it is: Fasta file should not contain a full sequence as i showed with the multiple N : these are not fragments ? the purpose is to have these fragments in the fasta to be aligned in the bam as output. Index generation is like some statistics that could help to match fragments in a faster way. Bam file is the aligned sequence and it's the output.

For the screenshot 2 : each line represent a sample, so several mRNA extraction right ? then how come to know which read belong to which sample?

Sorry this is new to me :) and thank you!

ADD REPLY
1
Entering edit mode

No, it's sequencer (producing fastq with reads), then alignment. The genome is something you download from somewhere, such as GENCODE or NCBI, you do not have to create that yourself. That is the fasta file you need, your actual sequencing run produces fastq with reads to align against that existing genome.

No, each line is a read, not a sample. Everything in the same BAM is from the same sample.

ADD REPLY
0
Entering edit mode

Oh oh, i was totally confused ! Now it's clearer, Thank you so much :)

ADD REPLY
1
Entering edit mode

The .bai is an index of the .bam file. It helps software like IGV jump from place to place in the .bam file easily. STAR needs an index of the genome. STAR using that index to quickly map reads to the genome.

ADD REPLY
1
Entering edit mode
2.1 years ago

Q1: Each line have 60 length and it does not look like fragments. I understood that GTF file will help to annotate exons and help with precision. But With the commands, what are we aligning ? and where is the reference genome ?

Well, yeah, every line has 60 bases. It's typical for fastas to have line breaks at 60 base intervals.

The reference genome is right there. The sequence at the beginning and ends of chromosomes is highly repetitive, and we don't really know exactly what it is, so it is totally normal for the first few Mb to be N's.

IGV by default does not color every base like that. Usually it only colors bases which differ from the reference. Are you positive that the file you are using in IGV is exactly the same as what you aligned to?

ADD COMMENT
0
Entering edit mode

Thanks a lot for these clarifications ! I took some times to digest all of these ! In the IGV alignement parameter was modified this is why i have colorful representation.

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6