If I am in a right way in RNA-seq
2
1
Entering edit mode
5.2 years ago
zizigolu ★ 4.3k

Hi

I have been given a bunch of tumour and normal samples (fastq)

By googling I think I should do as below

Alignment by STAR

STAR --genomeDir ./STAR_hg38_Genome --readFilesIn $fastq1 $fastq2

SortSam

java -Xmx4g -jar ./picard.jar SortSam I=Aligned.out.sam O=aligned.sorted.bam SO=queryname

Bam 2 Sam

samtools view -h -o aligned.sam ligned.sorted.bam

GeneCount

htseq-count --stranded=no -q aligned.sam ./gtf > counts.txt

Please some body inspect my workflow and kindly tell me if I am doing something useless or I can improve my results by changing some steps and tools

Thank you for any help

RNA-Seq alignment • 1.8k views
ADD COMMENT
1
Entering edit mode

Hi, in case you data is still unprocessed, I would trim the reads for adapter seqeunces and low-quality-called bases using for example the tool trimmomatic.

ADD REPLY
9
Entering edit mode
5.2 years ago
Thibault D. ▴ 700

Hi,

Your work looks fine to me, however, let me point some improvements:

STAR can give you coordinate-sorted bam-compressed alignement file, which makes your picard + samtools not necessary. Look at the option --outSAMtype BAM SortedByCoordinate.

You're working on cancer data, which will contain possible higher number of SNP//Fusions. The --twopassMode Basic option in STAR can help you to rescue some odd-but-present junctions in your samples.

I also recommend featurecount in place of HTSeq-count: reasons are well given in that blog article

In any case, your pipeline will work fine.

ADD COMMENT
0
Entering edit mode

Sorry I am getting this error when trying your kindly suggestion

/var/spool/torque/mom_priv/jobs/7474856.blue101.SC: line 3: 18398 Killed                  

STAR --genomeDir ./STAR_hg38_Genome --readFilesIn ./1.fastq ./12.fastq --outSAMtype BAM SortedByCoordinate
ADD REPLY
1
Entering edit mode

This error does not come from STAR, but from your scheduler (Torque). Probably reservation issues, however I do not have enough information with your post.

ADD REPLY
0
Entering edit mode

Thank you I ran the command directly in terminal rather that submitting a job to cluster that says

Sep 05 14:22:43 ..... started STAR run
Sep 05 14:22:43 ..... loading genome
Sep 05 14:23:02 ..... started mapping
Sep 05 14:33:59 ..... started sorting BAM
Sep 05 14:34:02 ..... finished successfully

And gave me Aligned.sortedByCoord.out.bam with 4471 kb size

Do you think this is normal?

ADD REPLY
1
Entering edit mode

Look at the logging files and do not hesitate to run quality metrics tests with either samtools flagstats, or picard CollectAlignmentSummaryMetrics. Neither runtime, nor file size are reliable indicators of a "good" result.

ADD REPLY
0
Entering edit mode

This is one loge file

Started job on |    Sep 05 14:52:01
                         Started mapping on |   Sep 05 14:52:25
                                Finished on |   Sep 05 15:03:25
   Mapping speed, Million of reads per hour |   36.42

                      Number of input reads |   6677781
                  Average input read length |   148
                                UNIQUE READS:
               Uniquely mapped reads number |   302
                    Uniquely mapped reads % |   0.00%
                      Average mapped length |   131.25
                   Number of splices: Total |   2
        Number of splices: Annotated (sjdb) |   0
                   Number of splices: GT/AG |   1
                   Number of splices: GC/AG |   0
                   Number of splices: AT/AC |   0
           Number of splices: Non-canonical |   1
                  Mismatch rate per base, % |   7.12%
                     Deletion rate per base |   0.00%
                    Deletion average length |   0.00
                    Insertion rate per base |   0.00%
                   Insertion average length |   0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   30095
         % of reads mapped to multiple loci |   0.45%
    Number of reads mapped to too many loci |   568
         % of reads mapped to too many loci |   0.01%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   99.42%
                 % of reads unmapped: other |   0.12%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

:(

ADD REPLY
1
Entering edit mode

% of reads unmapped: too short | 99.42%

You know where the problem is. I'll let you search what is your expected read size in STAR (index creation), then compare it to your actual read size (in your fastq file, or with FastQC).

ADD REPLY
0
Entering edit mode

Sorry now my job is running

Like the problem was that I was not assigning enough memory

Now my question please is; Does Aligned.sortedByCoord.out.bam output from the alignment is ready for featurecounts to extract the raw read counts? I mean do I still need to any sorting or conversion of the .bam file before read summarization?

ADD REPLY
0
Entering edit mode

Aligned.sortedByCoord is suitable for input into featureCounts. Note also that adding --quantMode GeneCounts to the STAR command will do just about what HTSeq-count does. You might as well do this, even if you plan to use feature Counts instead, if only as a check to make sure you know how your library is stranded.

ADD REPLY
0
Entering edit mode

Sorry I had added --outSAMtype BAM SortedByCoordinate at the end of code but again the output file named Aligned.out.sam

Is this output a .sam or .bam?

This is my code

STAR --genomeDir ./STAR_hg38_Genome --readFilesIn ./1.fastq ./2.fastq --outSAMtype BAM SortedByCoordinate
ADD REPLY
2
Entering edit mode
5.2 years ago

I would expect you to need to do some methods testing for each project.

It is especially important for you to take your time if you are new to RNA-Seq. You need to consider whatever you do as "initial" results and carefully critically assess your data before moving to publication.

I think the steps that you described are probably OK for some of the early processing steps for gene expression analysis.

However, I think there is at least one typo "ligned.sorted.bam" and I think you probably want to sort your alignment (TopHat2 alignments are already sorted by position, but I thought you needed to sort the STAR alignment).

You can use samtools sort or Picard AddOrReplaceReadGroups to sort your .bam (and add Read Groups, for the latter), although the exact command used for samtools will vary with the version.

Also, you are describing analysis for paired-end data (and that should give you somewhat different results of htseq-count versus featureCounts, if you sort by name). For this, you may already be OK. However, I think you will still want a position sorted .bam file for visualization in IGV (and other analysis).

ADD COMMENT

Login before adding your answer.

Traffic: 1549 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6