why reads align to other coordinates?
1
0
Entering edit mode
5.8 years ago
star ▴ 350

I have downloaded a specified region as Reference Genome from UCSC (because only this region is important for me) and aligned my reads against it using Bowtie2.

But I do not know why in my .bam file there are other regions with start position of 150864, 539750 and ... .

The Ref Genome (specified coordinate) that downloaded from UCSC.

hg19_dna range=chrX:146949598-147161084 5'pad=0 3'pad=0 strand=+ repeatMasking=none

2 first line of bam file (output of bowtie2)

D00823:89:HMHHMBCX2:1:2111:4641:60710   99  hg19_dna    150864  40  48M =   150946  130 CCTGCCTCAGCCTCCTGAGTAGCTGGGACTATAGGTGCCCGCCACCAT    DDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:31C15C0    YS:i:-5 YT:Z:CP
D00823:89:HMHHMBCX2:2:1116:19382:32083  99  hg19_dna    539750  42  48M =   540015  313 CCTATGGTAGCTCTGGCCAGCGGAACTTCCTGCATCCATGGAGCAATG    DDIIHIHIIHIIIIFEHIIIIIIIHIIIIIIHHHIIIHIIIIGIIII?    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
RNA-Seq alignment bowtie2 align tophat • 2.1k views
ADD COMMENT
2
Entering edit mode

You've downloaded the chrX reference from position 146949598 to 147161084. This is your reference sequence. Bowtie2 does not know that you cropped the reference sequence, the reference start at 1 for the software. If you want the position on the chrX you have to take the full chrX sequence or to add 146949598 to every hit

ADD REPLY
0
Entering edit mode

Thanks @ Bastien for your reply.

you mean the first position in my data is 146949598 + 150864 = 147100462

ADD REPLY
2
Entering edit mode
D00823:89:HMHHMBCX2:1:2111:4641:60710   99  hg19_dna    147100462  40  48M =   147100544  130 CCTGCCTCAGCCTCCTGAGTAGCTGGGACTATAGGTGCCCGCCACCAT    DDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:31C15C0    YS:i:-5 YT:Z:CP
D00823:89:HMHHMBCX2:2:1116:19382:32083  99  hg19_dna    147489348  42  48M =   147489613  313 CCTATGGTAGCTCTGGCCAGCGGAACTTCCTGCATCCATGGAGCAATG    DDIIHIHIIHIIIIFEHIIIIIIIHIIIIIIHHHIIIHIIIIGIIII?    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
ADD REPLY
0
Entering edit mode

but why my last start position when I add 146949598 to it is bigger than my Ref Genome range?

my last start position is 2390138 + 146949598 = 149339736

D00823:89:HMHHMBCX2:1:1201:1895:19391   147 hg19_dna    2390138 42  48M =   2389844 -342    CTGTTTTGTCGACACTGAAAACCTGTGTATGAAGTCACCGTCATCAAA    HIIIIHIIIIIIGIHIIIGIIIIIGIIIIIIIIIIIHIIIHIIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
ADD REPLY
0
Entering edit mode

Could you please give me the length of your reference sequence ?

ADD REPLY
0
Entering edit mode

it is 4231 lines. (if I did right: cat ref.fa | wc -l)

ADD REPLY
1
Entering edit mode

I was expecting the number of nucleotide in your reference sequence... Anyway, let's assume you have a fasta format with 60 character per line, minus the header 4230 * 60 = 253 800 bases

You've downloaded a sequence that is 211 486 bases long

How did you end up with a hit around 2 000 000 ??

Could you please try to download the full chrX and restart your alignment, this will avoid the misunderstanding

ADD REPLY
0
Entering edit mode

Thanks. I have done it and the head and tail of mt output are as below:

Head:

D00823:89:HMHHMBCX2:1:2114:9942:3900    147 X   2479520 42  48M =   2479476 -92 TGACAGGTGTGAGCCACCATGACTGGCCTGCAATGACACATTTGTATT    IIIIIIIIIIHIIIIIIIIIIIIIIIIIFFGIIIIIHIIGIIIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:-6 YT:Z:CP
D00823:89:HMHHMBCX2:2:2111:10601:89000  163 X   3042754 42  48M =   3042949 243 TTTCTTTCCTTAACCTTAGGGTCCGTTTTAGTTGCTAAAGGGGCATTT    DDIIIIIIIIIIIIIIIHIIIFH1<FHIHHIIIHIIIIIIIIHHHHHI    AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23T24  YS:i:0  YT:Z:CP

tail:

D00823:89:HMHHMBCX2:2:2105:10080:21276  147 X   155118974   42  48M =   155118837   -185    CTGCTCTTAAGTCTGTTGTTAAGTTTCATGTTATAGTGCCTCGTTAGA    IIIIIIIIIIIHHIIIIIIIHIIIIIHIIIIIIIIIHIHHHHIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
D00823:89:HMHHMBCX2:1:2104:1703:56241   99  X   155125479   42  48M =   155125618   187 CCACATAGCAGGAGGTGAGTGGCCTATGAGCATTACCACCTGAGTTCT    CDHHIIIIIGHIIIIHIIIIIGIIGIIIIIIIHHHIGIIHHIGHIIIH    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP

Now, would you please let me know how can I crop my specific regions?

ADD REPLY
1
Entering edit mode
samtools view input.bam "ChrX:3000000-3500000" > output.bam
ADD REPLY
0
Entering edit mode

Thanks.

i run it but I got errore :

samtools view uniqu_fmr.bam "ChrX:146949598-147161084" > output.bam
[main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

Then I run it: samtools index uniqu_fmr.bam

but I got this error:

samtools index: "uniqu_fmr.bam" is in a format that cannot be usefully indexed
ADD REPLY
4
Entering edit mode
5.8 years ago

If you have a sam file

samtools sort uniqu_fmr.sam > uniqu_fmr.sorted.bam
samtools index uniqu_fmr.sorted.bam
samtools view uniqu_fmr.sorted.bam "ChrX:146949598-147161084" > uniqu_fmr_ChrX_146949598_147161084.bam

If you have a bam file change the first line to :

samtools sort -o uniqu_fmr.sorted.bam uniqu_fmr.bam
ADD COMMENT
0
Entering edit mode

Thanks @ Bastein, it works!

ADD REPLY
0
Entering edit mode

Please follow-up on your threads. If an answer was helpful, upvote it, if it solved the issue, mark it as accepted.

enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 2020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6