Question

Problem to run checkm.

0

Entering edit mode

5.4 years ago

pablo ▴ 310

Hello, I have several files which contain proteic fasta sequences. Each file correspond to a cluster of genes.

UniRef90_100.fasta
UniRef90_101.fasta
UniRef90_102.fasta
UniRef90_103.fasta
UniRef90_104.fasta
UniRef90_105.fasta
UniRef90_10.fasta
UniRef90_11.fasta
UniRef90_12.fasta
UniRef90_13.fasta

I want to determine the contamination of each cluster. For that, I want to run checkm . I used checkm lineage_wf bins checkm but it does not work. I get this error message : checkm: error: unrecognized arguments:followed by all my bin files.

My question is : do these files are bins? Each file is compound as the following structure :

>UniRef90_A0A1B2YXP8 - Cluster: Uncharacterized protein
MRILRNFLGLFLLTAFIFSCVDENESNADFVDTISEPTNISALVSISQDNTGLVTIIPTG
EGVVTFNVDYGDGSDISGSINPGNSTEHFYSEGTYEATIIGTALDGSTAQATVTVVVSFI
APENLVVDILTSSGSYNILVSASADYATSFEVLFGDEAGGDATPMQIGEQLSHSYELAGT
YNVTITALSGGAATTQYSEEITITDPPVFDGFSTFEDFEGEVPGNFSFGGVGNVQVVANP
DNSGINTSTSVMQCTKDQGAEVWGGMGFAVNGHINFNGNNVLRLKSYAPEVGKVVKVKLE
TSAGNVAGLTYEFDMVTTVANQWEILTYDFSGAPDLDYITAIVFYDFGNQNAGVYHFDDV
EVGIGEYIQGIENFEGDVPESFTFGGVGGVEVIPNPDPSGENITGNVLQFVKDEGAEVWG
GMGFAVDVIDFNGASQIHLKSYAPEAGKVVKVKLETSAGNVAGLTHEVDVTTTVANEWET
LIYDFTGAPDLEYVSFIVFYDFGNTVGATYRVDEIQLID
>UniRef90_A0A1B2YXU0 - Cluster: Uncharacterized protein
MKYKILFLSILILFSCNHDNEKLDAIIKEYQNHEGYNYEDYPLGNFSEEYFKAEKEFAES
LLLKLDDIDITKLDENDNISYELLSFVLNDIIAYYDFERFLNPLLSDSGFHSSLVYNVRP
MYNYEQVKNYLNKLNAIPQYVDQYLPLLRKGLEKGVSQPLVIFKGYESTYNDHITKDFES
NYFYSPFNKLPNDISEIQRDSIFVAAKNAIEKSVVPQFIRIKDFFEKEYYKKTRTTIGVS
QTPNGSEFYQNRINYYTTSESYTADEIHQIGLKEVARIKKEMIKIIDELKFKGSFEEFFK
FLRTDEQFYAKTPKELLMYARDISKRADEQLPRFFKTLPRKPYGVAPVPDAIAPKYTGGR
YVGTSKNSTDPGYYWVNTYDLKSRTLYTIPALTVHEAVPGHHLQSALNNELGDSIPRFRR
NLYLSAYGEGWGLYTEFLADEMGIYTTPYEKFGKFTYEMWRACRLVVDTGLHTKGWSKEK
AIDYMSKNTALSLHEVNTEIDRYISWPGQALSYKIGELKIRELRNKAKDQLNDKFDIREF
HEKILEYGTVTLPTLERRINNYIEKKNE

checkm contamination fasta • 4.1k views

ADD COMMENT • link updated 5.4 years ago by Asaf 10k • written 5.4 years ago by pablo ▴ 310

0

Entering edit mode

It would help to provide a link to the package this program belongs to. Have you checked the in-line help to see if that offers any assistance on what the minimal usage needs to look like?

ADD REPLY • link 5.4 years ago by GenoMax 147k

0

Entering edit mode

I checked the documentation about this package and I think I did good ... My files are in the good format and the command line I used looks good.

ADD REPLY • link 5.4 years ago by pablo ▴ 310

1

Entering edit mode

5.4 years ago

vin.darb ▴ 300

According ti the READme:

By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna.

Example Usage

Assume you have putative genomes in the directory /home/donovan/bins with fa as the file extension and want to store the CheckM results in /home/donovan/checkm. To processes these genomes with 8 threads, simply run:

checkm lineage_wf -t 8 -x fa /home/donovan/bins /home/Donovan/checkm

Did you try to specify ' -x fasta ' ?

ADD COMMENT • link 5.4 years ago by vin.darb ▴ 300

0

Entering edit mode

Yes I tried but I always get the same results.. I don't understand why

ADD REPLY • link 5.4 years ago by pablo ▴ 310

0

Entering edit mode

I only need to add the -g option to work on proteic sequences.

ADD REPLY • link 5.4 years ago by pablo ▴ 310

score 2 · Accepted Answer · 2019-06-21

2

Entering edit mode

5.4 years ago

Asaf 10k

If the input is protein you should use -g

Wait, I suspect you wrote bins/* is that right? Can you add the full command line?

ADD COMMENT • link 5.4 years ago by Asaf 10k

0

Entering edit mode

The -g option worked. Thanks.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                  Marker lineage           # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  UniRef90_1           k__Bacteria (UID203)           5449        104            58         4     17   19   13   7    44      95.34           283.45              6.85          
  UniRef90_19          k__Bacteria (UID203)           5449        104            58         7     20   20   12   12   33      91.22           255.84             11.38          
  UniRef90_14         k__Bacteria (UID2570)           433         267           178         61   128   56   13   8    1       79.42           42.78              17.65          
  UniRef90_12          k__Bacteria (UID203)           5449         99            53         24    24   16   7    4    24      69.27           141.02             21.72          
  UniRef90_24          k__Bacteria (UID203)           5449        104            58         23    20   29   10   8    14      67.63           103.59             21.25          
  UniRef90_22          k__Bacteria (UID203)           5449         99            53         35    24   9    6    2    23      62.49

I got this kind of output, it looks bad isn't it?

ADD REPLY • link 5.4 years ago by pablo ▴ 310

0

Entering edit mode

Pretty bad, yeah. Each bin is a few bacteria.

ADD REPLY • link 5.4 years ago by Asaf 10k

0

Entering edit mode

I only shew you the head of the output. I got some other lineages :

UniRef90_65      p__Proteobacteria (UID3880)        1495        261           164        188    60   13   0    0    0       27.85            5.12              38.46          
  UniRef90_71        p__Euryarchaeota (UID3)          148         188           125        132    46   10   0    0    0       26.89            4.25              20.00          
  UniRef90_73     f__Rhodobacteraceae (UID3356)        67         615           329        451   164   0    0    0    0       26.65            0.00               0.00          
  UniRef90_7         p__Euryarchaeota (UID3)          148         188           125        133    37   15   2    1    0       24.44            9.74              14.81          
  UniRef90_16          k__Bacteria (UID203)           5449        104            58         69    12   6    5    0    12      24.39           29.06              35.29          
  UniRef90_67      p__Proteobacteria (UID3880)        1495        261           164        195    61   5    0    0    0       24.39

ADD REPLY • link 5.4 years ago by pablo ▴ 310

0

Entering edit mode

Quarters of genomes. Usually people use completeness > 70-80% and contamination < 20%. You might get good bins with 0% completeness so watch for those too.

ADD REPLY • link 5.4 years ago by Asaf 10k

0

Entering edit mode

You meant "0% contamination" no?

ADD REPLY • link 5.4 years ago by pablo ▴ 310

0

Entering edit mode

No, 0% completeness, checkm can't find the proteins it's looking for but other than that the assembly looks good in term of size and N50.

ADD REPLY • link 5.4 years ago by Asaf 10k

0

Entering edit mode

Ok I got it. And is there is a file where the output is stored? I can't find it.

ADD REPLY • link 5.4 years ago by pablo ▴ 310

0

Entering edit mode

hello, Do you know how can interpret # genomes and # markers and # marker sets columns ?

ADD REPLY • link 5.1 years ago by vm.higareda ▴ 30

0

Entering edit mode

I think you should open a new question if you're still struggling, with some more background