Hi,
I'm new to the genomic field and I have few questions. As per my indesrtanding, an aligner (STAR or RSEM or Tophat) is a tool that allow to align fragment from rna-seq experiment. As per my understanding these fragment of nucleotide or read usually measure 63bases.
For example, according to the following command line :
STAR --runMode genomeGenerate \ --genomeDir GRCh38.79.chrom1 \ --genomeFastaFiles genome/Homo_sapiens.GRCh38.dna.chromosome.1.fa \ --sjdbGTFfile gtf/Homo_sapiens.GRCh38.79.chrom1.gtf \ --sjdbOverhang 62 The downloaded fasta file look like this one :
Q1: Each line have 60 length and it does not look like fragments. I understood that GTF file will help to annotate exons and help with precision. But With the commands, what are we aligning ? and where is the reference genome ?
Q2:One other question, in the BAM file : which is a compressed form of the SAM resultant aligner. Is one line represent one sample ? if it is; How can we be sure that one read belong to a specific sample and not the another ?
Q3: Finally, is there any useful graph of the pipeline with input/output and software used?
Will be grateful for every piece of information that could help me. Thanks a lot. Lenny
Thanks for this reply. Let's say only from sequecing to alignement, the logic as per my understanding is this :
Sequencer (Fasta) ->STAR / Generation of index (BAI) -> STAR/ Aligement BAM
If it is: Fasta file should not contain a full sequence as i showed with the multiple N : these are not fragments ? the purpose is to have these fragments in the fasta to be aligned in the bam as output. Index generation is like some statistics that could help to match fragments in a faster way. Bam file is the aligned sequence and it's the output.
For the screenshot 2 : each line represent a sample, so several mRNA extraction right ? then how come to know which read belong to which sample?
Sorry this is new to me :) and thank you!
No, it's sequencer (producing fastq with reads), then alignment. The genome is something you download from somewhere, such as GENCODE or NCBI, you do not have to create that yourself. That is the fasta file you need, your actual sequencing run produces fastq with reads to align against that existing genome.
No, each line is a read, not a sample. Everything in the same BAM is from the same sample.
Oh oh, i was totally confused ! Now it's clearer, Thank you so much :)
The .bai is an index of the .bam file. It helps software like IGV jump from place to place in the .bam file easily. STAR needs an index of the genome. STAR using that index to quickly map reads to the genome.