Hi-C data alignment with star
1
1
Entering edit mode
5.8 years ago
Dataminer ★ 2.8k

Dear community,

Does anyone has experience with aligning data using star aligner? It would be very kind of you to share the syntax that you used or have a look at mine and point corrections (chimeric reads alignments etc)

As reads from both strands need to be mapped seperately, I am using following command:

STAR --genomeDir /data/genomes/ --readFilesIn /data/raw/HiREAD/HiC/T-Rep1_R2_001.fastq.gz --readFilesCommand gunzip -c --alignIntronMax 1 --alignIntronMin 2 --outFilterMultimapNmax 1 --runThreadN 8 --outFileNamePrefix T_Rep1_L2

Kindly let me know, if I am missing something here :)

bwa takes a lot of time .... a lot

Thank you

star hi-c • 3.0k views
ADD COMMENT
0
Entering edit mode

What error did you get? gunzip -c needs to be inside quotes (alternatively, you may use zcat). Also did you check the STAR manual for Chimeric alignment (Section 5) http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf

ADD REPLY
0
Entering edit mode

Hi Santosh,

This is working fine no error, I will have a look at the link, thank :)

ADD REPLY
0
Entering edit mode

BWA takes a long time because Hi-C datasets are large. What is your definition of "long", and how many CPUs did you use (full command line), what is your hardware? There are probably things one can optimize if you share your code.

ADD REPLY
0
Entering edit mode

Definition: 5 days of processing on HPC, with 12 threads and 132 Gb RAM. I understand that Hi-C data is huge and will take time, however 5 days is a lot :)

ADD REPLY
0
Entering edit mode

How many reads in the dataset?

ADD REPLY
1
Entering edit mode
5.8 years ago

For aligning Hi-C reads, you could try this command line with BWA (12 cores):

bwa mem -A1 -B4 -E50 -L0 -t 12 bwa_index.fa sequences_R1.fastq bwa mem -A1 -B4 -E50 -L0 -t 12 bwa_index.fa sequences_R2.fastq

This is what is used in the Snakepipes pipelines: https://github.com/maxplanck-ie/snakepipes/

Maybe your BWA alignment took longer than it should because of different parameters? Snakepipes is using 15 cores for that command and usually take less than a day to run. Even for deeply sequenced samples it took far less than 5 days.

With that command-line I don't see the point of using STAR for mapping Hi-C reads.

ADD COMMENT
1
Entering edit mode

Hi, I was using exactly the same bwa mem -A1 -B4 -E50 -L0 -t 12 ref.fa file1.fastq Btw I have plant genome polyploidy

ADD REPLY
2
Entering edit mode

For STAR I found this from the HIPPIE package for Hi-C data, perhaps it can help:

https://github.com/yihchii/hippie/blob/master/cmd/starMappingToBam.sh

So maybe you could grab some additional parameters from those? --outFilterMultimapNmax 1 and --alignIntronMax 1 you already had.

 --outFilterMultimapNmax 1 \
 --outFilterMismatchNovermax 0.04 \
 --scoreGapNoncan 0  --scoreGapGCAG 0  --scoreGapATAC 0 \
 --alignIntronMax 1 \
 --chimSegmentMin ${ChimSegMin} \
 --chimScoreJunctionNonGTAG 0 
ADD REPLY
0
Entering edit mode

After further research I saw that STAR, when used with several CPUs, will scramble the reads order. Hi-C tools are usually waiting for properly ordered files in order to build matrices, I think for a correct pairing of the individual mapped files.

For that you can for example use ReorderSam from Picard. Of course it needs testing and it might depend on the Hi-C suite you use afterwards to build matrices (HiCExplorer is taking R1.bam and R2.bam as input of hicBuildMatrix for example, I believe that other suites might take paired bam files as input).

ADD REPLY

Login before adding your answer.

Traffic: 1989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6