Question

chromosome identifier not found in FASTA file, prompting 'skipping' error, when chr identifier is there

0

Entering edit mode

19 months ago

hemr3 ▴ 10

Hi,

I have a FASTA file that looks like this:

>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826
TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094
ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2213:1470:83635

Which I created through the code below. The .bam file in question is an aDNA file from a Neanderthal (http://cdna.eva.mpg.de/neandertal/Chagyrskaya/BAM/).

samtools bam2fq chr22.rh.bam > chag.22.fq
bioawk -c fastx '{print ">"$name"\n"$seq}' chag.22.fq > chag.22.fa

I also have some BED files that correspond to specific proteins which look like this:

22  20061211    20061231    
22  20061274    20061275    
22  20061301    20061302    
22  20061329    20061333    
22  20061334    20061335

But when I try to get the sequence for this protein out of the FASTA file I get this error:

bedtools getfasta -fi try.fa -bed tango2_chag.bed -fo test.fa.out

WARNING. chromosome (22) was not found in the FASTA file. Skipping

I've tried to solve this by adding the chromosome identifier '22' to both the .bam and .fasta files separately, getting a .bam that looks like this:

samtools view chr22.rh.bam | awk -v OFS='\t' '{print ">"$3":"$name}' > try.bam

>22:D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826   0   22  16050036    0   44M *   0   0   TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]    XI:Z:TGATAAC    YI:Z:BCCBBEG    XJ:Z:CATCCGG    YJ:Z:CCCCCGG    FF:i:3  Z0:i:0  XT:A:R  NM:i:2  X0:i:2  X1:i:0  XM:i:2  XO:i:0  XG:i:0  MD:Z:0A2C40 XA:Z:14,-19792922,44M,2;    RG:Z:L5833
>22:D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094  0   22  16050038    0   42M *   0   0   ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG  ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]  XI:Z:CCTAACG    YI:Z:BB3ABGG    XJ:Z:CAGTACT    YJ:Z:BBCBBGG    FF:i:3  Z0:i:0  XT:A:R  NM:i:2  X0:i:2  X1:i:0  XM:i:2  XO:i:0  XG:i:0  MD:Z:2C4C34 XA:Z:14,-19792922,42M,2;    RG:Z:L5831

Which either produces a 'Malformed FASTA file error' or the same 'WARNING. chromosome (22) was not found in the FASTA file. Skipping' warning.

Is there any way to get the sequence out for these proteins using the Neanderthal FASTA file? I CAN produce protein FASTAs using the human FASTA file downloaded from NCBI, but not with the Neanderthal files (but this obviously does not have the data I would want in them - wrong species, despite close relatedness).

There's something wrong with the FASTA files I'm making - can anyone help solve this problem? I can't just download Neanderthal FASTA files as from what I've found, they don't exist.

fasta gene • 1.1k views

ADD COMMENT • link updated 19 months ago by Darked89 4.7k • written 19 months ago by hemr3 ▴ 10

0

Entering edit mode

If you need reads that overlap the intervals you have in your BED files you can simply use samtools view command with the region (-L) option.

ADD REPLY • link 19 months ago by GenoMax 147k

0

Entering edit mode

Hi, I need the sequences for the proteins from the FASTA file, I don't need to overlap the intervals in the protein .bed files, but thank you! Would this also work for the FASTA files?

ADD REPLY • link 19 months ago by hemr3 ▴ 10

0

Entering edit mode

Can you clarify how you are planning to get protein sequences from a DNA sequence alignment and that too of short reads from an ancient DNA sample. What is in try.fa?

ADD REPLY • link 19 months ago by GenoMax 147k

0

Entering edit mode

Hi, try.fa looks like the FASTA file at the top of the question

>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826
TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094
ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG
>D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2213:1470:83635

The protein sequences will be a bit spotty, but I want to essentially intersect the .bed file I have in the question to get as much of the actual sequence as I can.

I got these protein bed files via intersecting the DNA sequence alignment of the Neanderthal individual with the Gr37 human alignment. Of course, this did not work for all the proteins, as the aDNA .bed files do have gaps in them. This is just one protein that I could find. If it would be helpful (and if you want!) I could send you my pipeline?

ADD REPLY • link 19 months ago by hemr3 ▴ 10

0

Entering edit mode

Looks like you are mixing up chromosomal sequences from some human or Neanderthal assembly with the mapped to that assembly sequencing reads. The first (genomic fasta) knows nothing about any sequence variation likely present in your sequencing reads. The separate reads are of no use to look for the consensus sequence. You probably want to do something along the lines:

https://bioinformatics.stackexchange.com/questions/15168/how-to-generate-a-consensus-sequence-from-a-multi-reference-bam-file

ADD REPLY • link 19 months ago by Darked89 4.7k