Monitoring Chef runs without Chef

I, like many sysadmins, really want to monitor all the things I actually care about. Monitoring is in general hard. Not because it’s hard to set up, but it’s hard to get right. It’s really easy to monitor ALL THE THINGS and then just end up with pager fatigue. It’s all about figuring out what you need to know and when you need to know it.

So in this case I really need to know that my machines are staying in compliance with chef.

There was a few ways you can do this. The first thought I had was adding a hook into all of my runs and having them report in on failure. This is mostly because I’m always looking for another way to hack on Chef and work on my ruby. The big problem with this is:

  • What if the node is offline?
  • What if the cron doesn’t fire?
  • What if chef/or ruby is so borked it can’t even fire the app
  • What if someone disabled chef

I need a better solution

Knife Status

Knife status is just awesome, it has some awesome flags and generally I run it far more than I should. The great part about this query the server approach is that it lets me know;

  1. The server is still happy and spitting out cookbooks to nodes
  2. The status of ALL of my runs from the “source of truth” for runs

Not making my chef test rely on chef

But I’m not going to shell knife status. I’m a damn code snob and something about having the chef test rely on the chef client status didn’t seem right.

Instead I wrote a nagios script that I am not going to share in it’s entirety here because $WORK_CODE1insert sad face but I will tell you exactly how I did it.

How to python your chef, or how I stopped worrying and learned to love that I can still use python to do anything.

I’m the most experienced in python and almost all of our internal nagios checks we have written in python. So this is in python.

Step one

Use pynagioscheck and pychef. Seriously. Don’t reinvent the wheel here.

Step two

Create a knife object. have it take all your settings on initialize, then you can create functions for all the different knife commands to recreate them with pychef.

You really only need status for this one. The meat of status is this here, coderanger dropped this on me in IRC

for row in chef.Search('node', '*:*'):
    nodes[row.object['machine name']] = datetime.fromtimestamp(row.object['ohai_time'])

Step three

Now from here I created a TimeChecker object. It takes the dictionary of { server: datetimeObj } on it’s init. For consistency sake I also init self.now = datetime.now(). Then I have a TimeChecker.runs_not_in_the_last() that just takes an int.

The magic of runs_not_in_the_last I will also share with you because I’m proud of this damn script and want to share it with the world

diff = timedelta(hours=hours)
return [k for k in self.runtimes.keys() if self.now - self.runtimes[k] > diff]

Bam!

Step four

Now just extend NagiosCheck with KnifeStatusCheck, make all your options and other goods in your init and then make your check()

In the check you make knife, Make a Timechecker with the status return… then all you have to do is see if you have any runs_not_in_the_last for critical and then warning.

Gotchas and cleanup notes

USE EXCEPTIONS

seriously, this can and will make them so catch them properly and return errors. You will need to catch and handle AT LEAST - URLError - Status - UsageError - ChefError - At least two of your own exceptions

SSL errors

So there is no trusted_certs here. You need to either give your server a working cert, install the snake oil into the nagios server as acceptable or do the dirtiest of monkey patches.

# Dirty Monkeypatch
if sys.version_info >= (2, 7, 9):
    import ssl
    ssl._create_default_https_context = ssl._create_unverified_context

But before you do this think of the children!!!

Weird ass errors with join

I need to maybe open a ticket and patch pynagioscheck but I had the weirdest bug when raising a critical. It would die in the super’s check on “”.join(bt) or something of the ilk.

My work around was to not just pass msg to the Status exception but to make msg a list and put the main message in msg[0] and then put the comma joined list of servers out of compliance in msg[1]. This means the standard error comes up on normal returns but if you run the check with -v it will give you a list of servers out of compliance for troubleshooting or debugging. Not bad.

Handling the pem file

Eeeeehhhh This maybe my one cop out in the whole script. Basically I created a nagios user in chef with a insane never to be used again and promptly lost password and put the nagios.pem file alongside the check script. Then I let the script optionally take a pem name, and it just checks that the pemfile is alongside the check script. I was considering letting you specify a pem script somewhere on the server or in the Nagios’s users home directory but decided to bite that and take the simplest route there.

Don’t destroy your nagios server

Seriously. Did you see this code? Run a search on all nodes and then return an attribute for every node in your nagios server. This is not the worlds fastest check script.

Unless you dedicate some serious power to your solr service on your chef server you should make sure to only check this service once every ten minutes tops. I only check once an hour normally and then follow up with 10 minute checks on fail on my server since I only do converges every four hours so an “out of compliance” warning for me would be at the 12 hour mark and critical at 24 hours2.


  1. I don’t yet have any clearance to post or share anything I write for, while, at, or around work. The company owns all that, but we are currently working on getting to the point where we can share some stuff. Especially things not so related to our IP like infrastructure code, cookbook, checks, ect. 

  2. The reason I picked these numbers is I don’t want to know the FIRST time a converge fails. I use the omnibus_updater in my runs (Pinned version in attributes of course) so a failed run can be normal. Plus I am deploying something that important I am going to spot check runs and verify everything gets run with knife ssh. I just want to know mostly if a machine is out of the loop for more than a day because that’s a node that needs to get shot. 

Tagged ,

An Open Year

It's been about a year since my last post, mostly frustrated with Chef as a beginner. Now I spend most of my day writing cookbooks and recipes. In fact I am even helping the Lead Dev at work learn Chef and got back from Chef conference. There I met a lot of amazing people and even offered to help maintain BSD support in chef.

This post isn't about that so much. It's mostly about a behavior I noticed I picked up. When I worked for Stephens Media I spent a lot of my energy trying to contribute, in posts, open source, pull requests, ect. Then when I moved to Slickdeals.net my time was really sucked up. I drifted from working on Pelican and stopped doing as many pull requests. At some time I set up a personally hosted Stash instance. Then I locked that stash instance off behind a login. Then I started writing in my private confluence instead of here. Now all my projects these days are All Rights I noticed... hmph.

I don't know exactly what triggered this sharephobia but it needs to stop. I almost think it's some weird greed involving my personal time and effort but if I was greedy wouldn't I want people fixing up my code for me? Is there some revolutionary private research in all this that makes me more valuable? I think showing off my abilities and progress makes me more valuable.

I'm just currently working on pulling all my code out of my stash and putting it onto github, with a much better BSD license. I'm remembering what the subtitle of my blog really means.

I've spent a lot of time studying Ruby since I finished my DBA course. There is still a lot of areas where Chef could use improvements and I plan to do a lot about it. We are going to make BSD a first class citizen with Chef and hopefully many of it's tools and cookbooks too.1

Remember when I used to post monthly? Hahahaha. I don't want to use this as a journal, I already have one of those but I wanted to give a bigger picture life update since I am updating pages and testing my jenkins build trigger with github ;p


  1. I have always preferred UNIX to Linux. My first sysadmin job was a Solaris Admin, a job I did for a long time. With the advent of SystemD I've gone back to my love in the form of BSD. 

Tagged

Chef Frustrations

I've spent the last week working on implementing chef. The experience is frustrating to say the least. Instead of whining I wanted to take the time to write out some of my pain points and hopefully offer some constructive fixes to what I see as the wall in the learning curve.

Now to be clear up front. Most of my problems aren't with Chef, Ruby, or most of the core product; it's with implementing it. To be more precise I think the failure REALLY is documentation.

Anti-pattern One: Getting Started (into a corner)

Also known as the "Just enough to be dangerous but not useful" anti-pattern

I really liked the new learn chef. I have to give them a ton of credit for all the work but underneath all the new splash and presentation it's still the exact same old Chef 101 it was two years ago; it teaches you the barest of all basics and then drops you off to docs.opscode.com

I know that most would feel that statement isn't fair, since it teaches you all about the design and system behind how chef works, and that it does; but it still feels like not enough to be useful and here is why.

Anti-pattern Two: We Have no Patterns...

Learn Chef teaches you how chef works but not really how to use it at any level of scale; There is no real world usage taught anywhere. It teaches you to set up a Chef Enterprise server and then re-inventing the wheel with a homemade apache or ntp cookbook, and push it all to a vm but you would rarely do this in practice right?

When you leave Chef's documentation you learn about many very important Chef Patterns;

  • wrapper cookbooks
  • berkshelf way
  • one repo per cookbook vs monolithic repo
  • application cookbooks
  • service cookbooks

Why doesn't chef teach us these? Is this something we save for consultants to teach us at thousands of dollars an hour? Is it that Chef wants to avoid teaching patterns in order to remain as flexible as possible1?

It's not just chef either. Go to http://berkshelf.com and tell me how to use this tool assuming you've never done such before. If I was trying to remember a few commands or learn a new trick on top of something this tools docs would be great but it's missing the meat of what this tool is designed for and how to use it. A lot of chef's tools are treated this way.

Anti-pattern Three: ...So please learn everyone else's anti-patterns

This is my biggest frustration, OPD; Other People's Docs. As someone who has been working in Systems for 10+ years I have lived and learned so much from everyone else's blogs, which is why I feel the need to blog all my own lessons and information.

I feel that chef relies too much on OPD though. Especially because chef is such a fast moving target. It's amazing how many people who use chef that I talk to that use it in some odd, bizarre, and or generally 'not correct' way. It's usually because they learned a bad habit from a predecessor or found a bug in a long ago version and found some OPD that convinced them that "oh no you have to run everything chef-solo with your own special bootstraps, that is the ONE TRUE WAY™". I'm not saying that patten doesn't work but I doubt it's the best way for many infrastructures2.

I plan on documenting plenty of chef like things myself; in fact I plan on posting as much of my own OPD as possible but with how fast chef evolves as a product and with the large variance of methods for different environments I really hope people take everything with a grain of salt and read the date on the post when consitering my advice.

Here is a great example; where about 2014-07 I went into #chef and asked about some methods for setting things up and was linked to this blog which is treated like a defacto example of how to do things. But read all those updates... and then notice how it's using a lot of deprecated methods. I was linked to an article that could be titled "How to develop some really bad habits, but learn important things while you are at it." It's not Mischa's fault, It doesn't seem like he is a docs writer for Chef. Honestly I feel the best thing that could be done is this document be updated to the latest methodologies and tacked on to the end of learn chef as "One good method to get your enviroment up and going".

As a chef user do you even know about chef-dk? you probably should take a break from what you are doing, read this and then do this. Seriously don't you feel much better? This also should be on the end of learn chef guide. Hell this should probably be the first half of the learn chef guide.

I get that maybe they don't want to declare a "chef way" to do things... but at least give us some better hints.

Next Actions

Just to recap;

  • I believe chef's biggest weakness is documentation, which creates a wall in the learning curve to hit right after "I can now build and deploy a test apache on a linode" and "I can build and deploy this in a staging enviroment"
  • I think there should be a learn chef 200 series that goes over;
    • Using a wrapper cookbook, and the different types of abstraction you often see with these.
    • Teaching everything chef-dk adds; bootstrapping, runtests, and automated integration testing.
    • Highlighting several useful patterns for cookbook development.
    • Using more of chef's tools; ex ohai
  • If chef is going to rely on the community for docs maybe it should create a way where they can contribute to the main docbase just like they do code.
  • go here, have your life changed
  • If you are in the Las Vegas, NV area come hang out at #lvdevops on freenode and tell me how I make you feel
  • I'm going to spend another week or two trying diferent ways to structure my cookbooks and see what works.

  1. I believe this is a horrible anti-pattern in documentation. If you believe your power is flexibility then you should highlight that but still outline some predominate patterns for your top two or three use cases. 

  2. I know it's not the best way because they are deprecating chef-solo for chef-zero, which is good but it's a great example about the speed that Chef is changing. 

Tagged