Opscode Chef + RabbitMQ

posted Feb 25, 2012, 9:53 AM by Tyler Akins   [ updated Jun 8, 2012, 10:22 AM ]
I have the privilege of working with Opscode Chef at work, maintaining the recipes as various projects move to "the cloud" or otherwise want to script the setup of their environments.  While spinning up many new machines with knife, very rarely I will hit this problem.  After trying pretty hard to find the root cause and why it happens on the internet, I'll sum up what I found on to this one page.  Maybe it will help solve this problem for you too?

Symptoms

I was running knife and it wasn't doing anything at all.  No CPU activity and no interaction with anything else.  No error messages popped up.  No output to console at all after hitting enter to start the command. For all intents, it appeared that the program was hung.  I didn't run strace on it, but I did attempt to use kill and kill -9 on the process to no effect.

I really hate when things like that happen.

At this point, the server started acting funny and I couldn't spin up any instances, so I rebooted.

Try #2

So I tried to spin up the server with knife again and got some interesting error messages:

INFO: *** Chef 0.10.4 ***
INFO: Client key /etc/chef/client.pem is not present - registering
INFO: HTTP Request Returned 500 Internal Server Error: Connection refused - connect(2)
ERROR: Server returned error for http://bluemoon.fuf.me:4000/clients, retrying 1/5 in 4s
INFO: HTTP Request Returned 409 Conflict: Client already exists
INFO: HTTP Request Returned 403 Forbidden: You are not allowed to take this action.
FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
FATAL: Net::HTTPServerException: 403 "Forbidden"

Uh-ho.  Why would the server say that the client already exists?  Let's go see what the logs say.  I browse /var/log/chef/server.log and find this one gem of a line, which I quoted below.  It was not at the bottom of the file; I needed to scroll up several screens of logs in order to see this problem.

merb : chef-server (api) : worker (port 4000) ~ Connection refused - connect(2) - (Bunny::ServerDownError)

What is this Bunny and why is it down?  Turn out that RabbitMQ will not start because it has corrupted databases.  Bummer.  You may ask "how can I fix such a thing?"  Well, you can't really repair the databases.  Instead, we just delete them.

The Fix

When you delete the RabbitMQ databases (conveniently located in /var/lib/rabbitmq/mnesia), you are not done yet.  In order to set up Chef in RabbitMQ, you need to add a vhost, username, password, and permissions.  The vhost is /chef, username is chef, and the user should have all permissions to the vhost.  The slightly tougher part is getting the password, but it's found in /etc/chef/solr.rb as amqp_pass.  Here is a shell script I used to fix the problem.  You're welcome to use it.

#!/bin/bash

# Fix RabbitMQ by removing the databases

service rabbitmq-server stop

if [ -d /var/lib/rabbitmq/mnesia ]; then
echo Removing mnesia directory
rm -r /var/lib/rabbitmq/mnesia -r
fi

service rabbitmq-server start


# Add the Chef vhost, username, password, and permissions

rabbitmqctl add_vhost /chef
PASS=$( grep ^amqp_pass /etc/chef/solr.rb | cut -d '"' -f 2 );
rabbitmqctl add_user chef $PASS
rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"
Comments