I need to extract the FASTA reads from a specific region from a chromosome in a BAM file format.
2
0
Entering edit mode
3.8 years ago
screadore ▴ 20

Hello everyone,

So I am working with Whole Genome Sequencing (WGS) data and have generated the following files: - BAM - VCF - BED - FASTQ

I need to extract a specific region of sequence reads to be able to run a BLAST analysis on it to compare to Hg19 to see how they align.

The Chromosome I need to extract is between Chr1, position 17349050 and Chr1, position 17349253.

Can anyone point me in the right direction? I've been working several days on this and can't figure it out because I've tried to use "Samtools view", but I'm on a Windows system and don't have access to the Ubuntu shell on the app store since my company restricts it.

I have an account on Galaxy, but can't find a way to do this on there. So any input will be greatly appreciated. Thank you.

genome sequencing gene alignment next-gen • 3.9k views
ADD COMMENT
0
Entering edit mode

I'm on a Windows system and don't have access to the Ubuntu shell on the app store since my company restricts it.

something is just wrong if your company ask you to work on windows for a bioinformatics job.

ADD REPLY
0
Entering edit mode

I've been trying to get a Linux OS for awhile lol

ADD REPLY
1
Entering edit mode
3.8 years ago

that should not be to difficult using galaxy.

  1. select samtools view from the tools menu
  2. indicate that you want 'filtered/subsampled' view
  3. then open 'configure filters' menu
  4. Filter by regions
  5. select 'manually define regions'
  6. in the box below it, fill in your region : chr1:17349050-17349253
  7. Execute
ADD COMMENT
0
Entering edit mode

Thank you! I tried it before but got confused with all of the options. I just did this and its running now. Thank you again.

ADD REPLY
0
Entering edit mode

How can I visualize the FASTA nucleotides so I can run them on blast?

ADD REPLY
0
Entering edit mode

When I clicked visualize data then "Manually edit text" it spit out this:

��BCo�[]�$�U����wA&H���P �B���?�Xڻ?�{���,� k����ݙ��=s�]R$""� 9 �/;��@����ٹ�v������[-���[���� l-����׮��������T~��v�Ǚ����/��ω�P+^��ρc���L|1 :إ��� �.�e�OA�f����]�����w�\��z� ��<��]���4��̭�Q��u��ͅ�M��.�D7��s���� �c�"��V,���;��Lj������;ڙ{Lj�.�����"P-�֞��9�Z�W�I-��_y�����:&�y�����W�{�������n���K/SV�zҬ�W��:7.�u�Ѻa��r'��M܉3qL�6�IX�<���� ��85�m{�gEZ΋�0��e^��.R�(�6)��,2s]�I[T�qXWU{8�M�o��+������.��5�|��9�7��M�r�k���Λݺmm8,<��p����"a�u �/�j����:�׬�L�d�ڲ��eB���(/��L˽4G+9I���'��)ث] �5���Ӷ���"��N�tE�pp���Ȋe޴��l�O �?��O�B�sŎr����٣���k%e��J�E�l�����K�#�Zt��Ď_}��k7��?k�� ��׎�,R�n)L�IJ��T+�G�@�׸ ���6���פ������e�3e�>�M>I��w�^|�6���X���̠�v���:�϶���g�d]-5������M���n�M��& ���{E���<٘t��>���l�4Y�A�0]� 4���=�<�l��L6�!����Ȋ]���Ɓw�i��2�uQ��/��=�}iځ$)0kVy�N�f��g���'fN`f&l'����l�ɟ�>5< *A< ��ؒ?�����:����2�� �? Ugɓz}�Y|�ڪ������2n��~����� +�~�c�*)�؀�v]l��+@�d��wt!�aR�$�� Ƽ�������R�&���� �.���eڂ;[Th��@CR�FW!��:ҙZ¯��

ADD REPLY
1
Entering edit mode

BAM is a BINARY file....

ADD REPLY
1
Entering edit mode

as Pierre Lindenbaum indicated, BAM is a binary format and you thus can't simply 'view' it.

To get the fasta version of it, select the samtools fastx from the galaxy tools menu, input is your subsampled bam file and output will be a fasta file.

ADD REPLY
1
Entering edit mode

Thank you! I just did this I hope it comes out correctly. Thank you for your help I greatly appreciate it.

ADD REPLY
0
Entering edit mode

When I did this it gave me this error: [M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 0 reads.

Any chance you know how I can fix this?

ADD REPLY
1
Entering edit mode

Do you have a sub-sampled BAM file that is not empty? Did instructions provided by @lieven work?

ADD REPLY
0
Entering edit mode

I'm not sure if it's empty or not. I'm not sure of what the issue is but these are all of the details about the run. If you or anyone could tell me what I did wrong and what I need to do I would greatly appreciate it as I'm not sure of what to do. Thank you!

Dataset Information:

Name    Samtools view on data 5: filtered alignments
Created Wednesday Feb 3rd 9:54:27 2021 UTC
Filesize    5.6 KB
Dbkey   hg38
Format  bam
File contents   contents

Tool Parameters:

SAM/BAM/CRAM data set   MyFile.bam
What would you like to look at? selected_reads
filter_config   
Filter by regions   text
Filter by regions   Chr1:17349050 Chr1:17349253
Filter by readgroup no
Filter by quality   Not available.
Filter by library   Empty.
Filter by number of CIGAR bases consuming query sequence    Not available.
Require that these flags are set    Nothing selected.
Exclude reads with any of the following flags set   Nothing selected.
Exclude reads with all of the following flags set   Nothing selected.
subsample_config    
Subsample alignment fraction
Downsampling factor 1.0
Seed for random number generator    Not available.
What would you like to have reported?   retained
Produce extra dataset with dropped reads?   False
adv_output  
Collapse backward CIGAR operation   False
Output format   bam
Reference data  n

Job Information:

Command Line    empty
Tool Standard Output    empty
Tool Standard Error 
[main_samview] region "Chr1:17349050 Chr1:17349253" specifies an unknown reference name. Continue anyway.
ADD REPLY
1
Entering edit mode

I think you may have an empty file (as in no reads but just the header).

[main_samview] region "Chr1:17349050 Chr1:17349253" specifies an unknown reference name. Continue anyway.

This is where your problem lies. How did you specify your interval? Like lieven.sterck specified? chr1:17349050-17349253. The chromosome name needs to match perfectly with your BAM for this to work and no spaces.

ADD REPLY
0
Entering edit mode

I'm unsure how can I figure this out?

ADD REPLY
1
Entering edit mode

Can you select samtools view from the tools menu and choose "Just header" under "What would you like to look at" and then "output format" --> "SAM". That should be text. Post a few lines here.

ADD REPLY
0
Entering edit mode

Yes sir thank you so much for the help. Here is the output:

1.QNAME 2.FLAG  3.RNAME 4.POS   5.MAPQ  6.CIGAR 7.MRNM  8.MPOS  9.ISIZE 10.SEQ  11.QUAL 12.OPT
@HD VN:1.4  SO:coordinate                                   
@SQ SN:chrM LN:16571                                    
@SQ SN:chr1 LN:249250621                                    
@SQ SN:chr2 LN:243199373                                    
@SQ SN:chr3 LN:198022430
ADD REPLY
1
Entering edit mode

That's header. It doesn't show that you have any reads. Sorry, but troubleshooting in Galaxy is like having a hand tied behind your back.

ADD REPLY
0
Entering edit mode

I did it on my BAM file. Could that be why?

ADD REPLY
1
Entering edit mode

SAM/BAM files need to have a valid header. So it looks like you have not managed to slice the file. Can you retry using @lieven's instructions and then use this exact specification for region chr1:17349050-17349253.

That is barely 200 bp. If you get nothing broaden the region out to 500 bp or a kb and see if that works. If you get nothing it is possible that there are no reads aligned in that region.

ADD REPLY
0
Entering edit mode

All I had to do was remove the space I had in the chr1:17349050-17349253 and it worked. Thank you so much!

ADD REPLY
1
Entering edit mode

yes indeed, sorry about that (fixed it in my original post)

Moreover, while the command will technically work now (I hope/assume) , there is a fair chance you still get an empty output file because (as @genomax pointed out) the region is quite small so there might just not be any reads aligned in that specific region

ADD REPLY
0
Entering edit mode

Thank you! So my only issue now is that the output gives me this:

>A01049:11:HNKJ7DMXX:2:1426:23773:21057
CCTACCGCCTGTTCCGCTGCCACACCATCATGAACTGCGTAGACGTTTGCCCCAAACACCTCAACCCCACCCGCGC

but a lot of them. How do I tell what the position is and stuff?

ADD REPLY
0
Entering edit mode

what you got now is a fasta file of the reads aligned in the region you specified.

well, you did talk about doing a blast analysis to the Hg19 genome? Doing that will give you the regions where they align.

Coming to think of it, they will likely align in the region you specified above (go figure :) ) . What are you actually looking for? you want to see how specifically they align (== the exact alignment) ?

ADD REPLY
0
Entering edit mode

Sounds like what you want to do is get a consensus from that region? You can do that with this method: Generating consensus sequence from bam file

It may be possible to do this on galaxy. Give it a try by looking at the steps mentioned in the tutorial.

ADD REPLY
1
Entering edit mode
3.8 years ago

I need to extract a specific region of sequence reads to be able to run a BLAST analysis on it to compare to Hg19 to see how they align.

Use bcftools consensus to combine the vcf with your reference to generate what the sequence you really have is. Then you blast that consensus.

And yeah, doing bioinformatics on galaxy is going to get very old very fast.

ADD COMMENT
0
Entering edit mode

Thank you I appreciate it. I know I just need to wait to get approval for my Ubuntu terminal download for windows to use more advanced Linux based tools.

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6