Forklifting Chef Server

Link

From There to Here and Here to There

Due to the requirements of my client I have a need to move our chef server into a completely different hosting environment. This little journey started when I asked that we actually try and preform a restore from our chef backups. When that didn’t work, one of my associates threw a bunch of ruby code at the problem. The solution was a little kludgy and didn’t entirely do what I wanted, so I hit the boards and tried to figure out a better way to do this. After some missteps I came across the knife-backup gem and all was well….. for now.

First Steps

Restore, they said, it’ll be simple they said…. yeah right. To be clear we follow the OpsCode guidelines here https://docs.chef.io/server_backup_restore.html. You basically tar up a bunch of files and take a database dump. The problem we had, like most people, is that while we backup the server, we hardly if ever(never) perform a restore. It turns out that when we perform the restore process nothing really works all that well. After a few hours we were able to trace the issues to hostnames and key files. While this method will work in a crunch I started to wonder if there was a better way. In the world of cloud and amazon things are much more disposable and in essence what I wanted to do was figure out which is easier, standing up a new chef server or going through my highly manual chef backup and restore.

Finding of a Gem

I decided to see if anyone had automated the backup and restore process in chef based on the OpsCode guidelines. After some googling, I decided that a full backup and restore wasn’t what I really wanted. For what I wanted, and how we work in general its more useful to be able to import a set of cookbooks and all their versions. The “all of their versions” part is important, we make heavy use of environments and often have several versions of a cookbook in use at the same time. Enter knife-backup.

Really, Really You Need That Ruby

This is where I sometimes wonder goes through a developer’s mind. Chef is bundled with specific versions of ruby that do not change. Chef 11, a very common version of chef in use, is pegged at version 1.9. The chef versions that I’ve worked with, chef relies mostly on the embedded version of ruby installed with chef. You need to go out of your way to use the system version of ruby. As a result, for better or worse we’ve come to rely on the chef installed ruby rather than the system interpreter for all chef related tasks. To keep out images clean in fact, we don’t even install a system ruby and use only the embedded version. This is why when I attempted to install knife-backup, while the gem installed it couldn’t run because of a ruby version mismatch.

Trials and Tribulations in Ruby

Nothing is without difficulties, using knife-backup requires the ability to run two versions of ruby at once. I recommend that you do not install 2 system versions of ruby. Unless you have a linux distribution has a mature way of handling multiple interpreters (gentoo) you should keep the second interpreter in user space. On Redhat systems I recommend allowing chef to use its embedded ruby and you use RVM to manage the ruby needed for knife backup.

Installing rvm

According to the RVM website its as easy as an curl command

curl -sSL https://get.rvm.io | bash -s stable

However in my world nothing is ever that straight forward. I ran the command and found i cant connect to the Internet. All the environments we work in are protected by proxy servers. The proxy servers filter all requests so that only particular useragents work. This means, curl, the primary method of installing RVM, does not work out of the box for me. I was able to get it going by modifying the .curlrc file in the target machine in the following way:

user-agent=”only useragent our proxies accept”

Success! I now have RVM installed. Now on to installing ruby. RVM works by doing 2 things. First RVM uses manages symlinks and environment variables in userspace to allow you to pick which RVM installed ruby interpreter you want. The second thing RVM does is it creates an installation of ruby from source. In order to kick this off execute an RVM install <ruby version>; RVM user <ruby version>. Specifically we want ruby 2.1 and RVM is going to attempt to create it from the ruby source. In order to create the interpreter from source RVM will attempt to install development tools, i.e gcc, so make sure you have RHN access so you can install all the necessary rpms. Upon executing RVM install 2.1 it should take a while as things are being compiled. Lastly, simply execute RVM use 2.1. execute a ruby -v to verify the version and we are ready to go.

Nothing is Ever Simple

Just as we could not reach the Internet with curl, the method by which ruby reaches out to the Internet to install gems is equally blocked. My next effort is getting ruby to dance the same user-agent dance that curl did. Now, I must admit I could not find out how to change the ruby user-agent easily, although I know it is possible. I instead installed RVM on a machine that had unfettered Internet access, installed the version of ruby I needed, then did a gem install knife-backup. After the install, I located all the gem dependencies for knife-backup and from that list performed a gem download of all the needed gems. Once I had all the gem archives I tared them up and transported them over to Internet restricted box and did a local gem install. This allowed me to get knife-backup installed even though I could not get past the proxy server.

Still, Nothing is Ever Simple

RVM has taken care of ruby and a little bit of tar magic was able to get the gem installed. After installing the knife-backup on machines that could access the source and destination webserver I was ready to go. I executed the backup command:

knife backup export cookbooks environment nodes databags -D tmpbackupdir
tar cvjf tmpbackupdir backup.tbz2
scp backup.tbz2 serverthatcantalktonewchefserver:

Once the backup archive was transferred over I untarred it and ran the restore command:

knife backup restore -D tmpbackupdir

And it failed…. One other requirement of the chef backup gem is chef client >= 12.x. Chef 12.x has a new requirement for cookbooks, they must all have the name attribute defined in the metadata.rb file. Since the cookbooks come from a chef 11 server, where no checking is done for this by the chef 11 clients. When the chef 12 client attempted the import to the chef 11 server, it failed a client side check and died. Fortunately, the layout of the backup archive is <cookbookname>-version. With a little fit of sed and find I was able to on the fly create a name attribute for all the cookbooks with something similar to the following:

find . -type d -maxdepth 1 -exec (cd {}; echo "{}" | sed -e 's/.\/(.*)-[0-9.]*$/name=\1/g' >> metadata.rb) \;

With the name attribute now set I ran the restore command once again. During execution I noticed a curious message about merging cookbooks, but thought nothing of it at first. After the restore I logged into the chef UI and started to take inventory of the versions and contents of the cookbooks and that’s where I understood merging message.

The problem displayed itself as follows, while I was supposed to see every version of every cook that existed on the chef server of origin as well as the databags, nodes, and environments. While the databags, nodes, and environments transferred fine, several cookbooks were missing versions. This was bad, considering one of the chief requirements was to transfer all cookbooks and all versions over.

I started to page through the gem source code to see why exactly this was occurring. After adding some debug statements so I could trace execution I started to understand the issue. It had to do with the chef cookbook path. Chef can only have one version of a cookbook in its path when you are doing a cookbook upload if you want it to behave as expected. The restore script was putting the base directory containing all the versions of all the cookbooks in the chef path. When asked to upload a particular versions of a cookbook, knife would see all the versions in its cookbook path, pick the highest, and just upload that one several times.

The solution was to patch the gem. What I needed to do was make a temp directory point the cookbook path there. From there I would symlink each version of the cookbook into the temp directory and allow the restore function to perform a knife cookbook upload. The patch is below:

--- backup_restore.rb.orig   2015-01-23 12:31:20.289221254 -0500
+++ backup_restore.rb      2015-01-14 15:59:19.780221191 -0500
@@ -159,11 +159,14 @@
     def cookbooks
       ui.info "=== Restoring cookbooks ==="
       cookbooks = Dir.glob(File.join(config[:backup_dir], "cookbooks", '*'))
+      #Make tmp dir
       cookbooks.each do |cb|
+        FileUtils.rm_rf(config[:backup_dir] + "/tmp")
+        Dir.mkdir config[:backup_dir] + "/tmp"
         full_cb = File.expand_path(cb)
         cb_name = File.basename(cb)
         cookbook = cb_name.reverse.split('-',2).last.reverse
-        full_path = File.join(File.dirname(full_cb), cookbook)
+        full_path = File.join(File.dirname(config[:backup_dir] + "/tmp"), cookbook) 
         begin
           File.symlink(full_cb, full_path)
@@ -172,12 +175,15 @@
           cbu.name_args = [ cookbook ]
           cbu.config[:cookbook_path] = File.dirname(full_path)
           ui.info "Restoring cookbook #{cbu.name_args}"
+          ui.info "TMP=" + config[:backup_dir] + "/tmp"
           cbu.run
         rescue Net::HTTPServerException => e
           handle_error 'cookbook', cb_name, e
         ensure
           File.unlink(full_path)
         end
+      ui.info "Deleting TMP"
+      FileUtils.rm_rf(config[:backup_dir] + "/tmp")
       end
     end

After applying the patch, all versions of all cookbooks correctly appeared. Once this process was ironed out I had a fairly easy way to sync the servers moving forward and a backup and restore strategy that works without having to know how to mess with SSL certs.

There is one caveat to all this, my major requirement was to move the cookbooks, not the nodes. However, accounting for existing nodes is very easy and requires 2 steps. First, when knife backup export is execute use the node option to get the node information and the certs. Secondly, on each node, point to the new chef server.

Conclusion

The team I work with is going to develop this method further because it allows the cookbooks to be what is most valuable piece of information, not the chef server itself. In the world of disposable computing this is the way it should be to allow for rapid expansion and recovery.

 

 

Defending Against SSH Brute Force Attacks

Just Trying to Host a Website

So here I am trying host a personal website once I figured out a little bit about amazon in 2010. After a month or two of poking around and figuring out how to get the AMI I want running everything looks fine. I can now self host all the pictures and videos of cats I’m wiling to pay for in S3 buckets. At pennies per gigabyte this is alot of cat video and I am very pleased.

Little Website, Little Website Let Me In

Like all good, or at lease paranoid admins I regularly troll all the logs on the system that I have running. I see the “normal” strange web requests on my apache server, but I keep that pretty up to date so I’m not concerned. After looking around for a while I see failed SSH logon attempts all over the place from ip addresses I don’t recognize. The big bad wolf is at my door. After consulting with a colleague at work I learn about fail2ban. This is a unix daemon that watches for events in logs, and then bans the ip address that causes certain log entries via iptables for a set amount of time. Fail2ban also emails me when this happens so I can keep track. SSH is the only service I have issues with since the instance is locked down. I don’t like to white list ip addresses for a cloud VM. I don’t have a VPN into amazon set up and the ip address I administer from often changes from my ISP and because I have administration scripts set up from my mobile phone. Fail2ban seems like the perfect solution. I use it to protect my AWS servers and my machines at home which have limited external access.

Nothing Lasts Forever

The fail2ban solution worked for over three years. Occasionally I would get a persistent brute force attack but on average I was banning about 6-7 ip addresses per day and they wouldn’t be back after the fail2ban cool down period of 3 hours. Then one day in 2014 I started banning hundreds of different ip addresses per day. My inbox quickly goes to 999+ unread messages.

In short, it looks like bot nets with many IPs are attacking the average person trying to brute force a password or using one of the man SSH daemon exploits out there to gain access. As a knee jerk reaction to stem the flow and give me time to think I give in and white list ip addresses I can log in through the amazon web console and my home ISP router. I hate this solution mainly to me because it seems like lazy thinking and I HATE MAINTAINING WHITE LISTS.

What Next (Knock Knock)

What concerns me most is the lack of access to my home servers. I’m a software consultant and access to my home network has thrown many a project a lifeline. It’s almost impossible to know your external ip address from a client’s site the night before if you’ve never been there. I remember a patch set that was submitted to gentoo linux a few years ago. It was an experimental change to the SSH daemon that allowed something called port knocking. The SSH server would appear to be down to the casual observer. When you wanted to connect you would reach out to your server on predefined ports in a predefined order. This communication called knocking is one way. You send the packets, and it appears they are discarded. However, if you used the right combination and order to knock, ssh would then start listening on port 22 and you could log in.

The method mentioned above was a specialized case for SSH. in the intervening years a more general solution was created called knockd. This daemon will enable port knocking for any listening daemon.

The Setup

First make sure you have console access to the machine, or in the case of AWS, don’t save the rules until you are sure that they work so a reboot can get you back in. The easiest way for me to set this up was to SSH into the machine I wanted to configure. I installed knockd unconfigured and went through the pre-setup checklist. This included setting up ip tables to deny all incoming connections and allow established connections to be maintained. If your rules were correctly set then your ssh connection should still be active but no new ssh connections can be established. For example:

iptables -I INPUT -p tcp -s 0/0 --dport ssh -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -s 0/0 -j DROP

If you are better at iptables foo than I am then you can set the default policy to deny instead of the deny all rule, its neater and I’ll go to it as soon as I clean up some strangeness in my existing iptables setup.

[options]
 logfile = /var/log/knockd.log
[openSSH]
 sequence = 7324,4566,7446,4324
 seq_timeout = 5
 command = /sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 22 -j ACCEPT
 tcpflags = syn
 cmd_timeout = 10
 stop_command = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
[closeSSH]
 sequence = 9999,3333,7123,6467
 seq_timeout = 5
 command = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
 tcpflags = syn

The above will open the ssh port from a specific ip address when you hit the ports 7324,4566,7446,4324 in order and hard close the connection when you hit ports 9999,3333,7123,6467. My only suggestion here is that you pick ports > 1024 and < 65K. After this start the knockd daemon.

Testing It

Tail the knockd log and now try to open up the ssh daemon port using nc as follows:

nc -z myserver.com 7324 4566 7446 4324

You should see the following in the log for the open:

[2014-12-28 15:57] myclientip: openSSH: Stage 1
[2014-12-28 15:57] myclientip: openSSH: Stage 2
[2014-12-28 15:57] myclientip: openSSH: Stage 3
[2014-12-28 15:57] myclientip: openSSH: Stage 4
[2014-12-28 15:57] myclientip: openSSH: OPEN SESAME
[2014-12-28 15:57] openSSH: running command: /sbin/iptables -I INPUT 1 -s myclientip -p tcp --dport 22 -j ACCEPT

You should see the following in the log for a manual close or a timeout:

[2014-12-28 15:46] myclientip: openSSH: command timeout
[2014-12-28 15:46] openSSH: running command: /sbin/iptables -D INPUT -s myclientip -p tcp --dport 22 -j ACCEPT

Make sure you see the timeout close the connection or check iptables every once in a while to ensure you aren’t leaving past ip’s open due to a misconfiguration.

Conclusion

I keep fail2ban running just in case, the number of strange ssh login attempts has dropped to 0. The technique, for now, is very effective. Eventually I’m sure I’ll need a new defense, but I’m hoping for the next three or four years I’m good.

 

DevOps/CI and the T word

New Sheriff in Town

Having hopped on the CI/CD train early, starting out most of my work was with introducing a client to CI/CD and standing up an initial implementation. Later as  DevOps and CI/CD matured, it became not uncommon to have to take over a failed or barely limping CI/CD implementation. The T word the title refers to is TRANSITION. This article intends to discuss some of the pitfalls and gotchas to inheriting someones CI/CD infrastructure. Why must we deal with some other persons  mistakes? Typically after having spent X dollars, where X may be large, it may be politically infeasible to ditch the current bad way of doing things. To satisfy this scenario I recommend taking a more adaptive approach to resolving the situation. So congratulations, you’ve won a new CI/CD contract, you are the new sheriff in town, and you are stuck with a mess, what do you do now?

The Handoff

Unless things have gone really badly there will be a handoff from the old DevOps team. This handoff is usually in the days or weeks in terms of time. Depending on the size and complexity of the implementation you need to ask your questions carefully since this is usually not enough time to gleen everything from the incumbents. Here I will give 2 pieces of advice, do not criticize the outgoing teams implementation in any way, and prioritize learning the path of how the application gets from code to production. The rationale behind the first item is simple, you need the information the out going team has and criticizing their implementation makes them less likely to willfully give you the information you need. Remember, the short comings may not be their fault, the previous team may have been limited on time or resources, or any other number of factors and opening old wounds will not encourage them to share what they know with you.

Path to Production

Ferreting out this information is important for many reasons, and it will have to be done no matter what. Having the old team guiding you through the process is much easier than reverse engineering. What exactly is the path to production? Well it is different for every organization, but the fundamentals are the same. Source code, through some process, is transferred to a production machine. In very immature organizations this can be as simple as a file copy from the developer machine. In more advanced CI implementations there may be gates such as static analysis, automated testing, and binary packaging that take place. What ever the process is, make sure the outgoing team walks you through step by step before they go. This is the perfect opportunity to create a wiki to document the process.

Owning the Process, Warts and All

The first step to fixing a problem is understanding it. You have been brought in to correct current CI/CD deficiencies, but you only have an idea of what they are second hand at this point. After you get as much information as possible from the outgoing team, and learn the path to production it is time to put all your wiki’s and notes to the test. What has worked for me in the past is working with project management to have you or your team introduce a low impact change, or story in the agile parlance, into the production environment. In the ideal scenario, you have the expertise of the previous team standing at the ready, but you make every attempt to introduce the change by yourselves. After walking through a small release, and possibly making a few mistakes two things will happen. First, you will have gained the confidence to keep the current process moving. Not stopping forward progress can be very important for client credibility as well as reducing the overall cost of your introduction to the project. Secondly, you will learn first hand what all the flaws are.

Manual Steps, One of the Warts

True CI/CD is supposed to be highly automated. Unless you are Netflix or Google it is highly unlikely there is true full automation. Usually, there is some minor but critical set of tasks, that are not completely obvious which are manual. One of the advantages to owning the process is exposing these little warts as quickly as possible. But as described, owning the process will not expose all the little warts. Since it only focues on the path to production many other IT processes are missed. Upgrading the platform application stack, OS upgrades, hardware replacement. While you should focus on the path to production first, see if the outgoing team can address these items in some way before they leave. At best there is a script they forgot to tell you about originally, at wort you have a roadmap for future work.

Changing the Tires at 90mph

By this point you or your team should have a minimal idea of how the old CI/CD infrastructure works. The next step is incremental improvement. This approach fits very well with the agile development philosophy. Internally it is time to decide what to change first, hopefully for the better. My advice it to pick several items that are visible to both management, the CI/CD consumers (developers and testers), and to start small. This is where maintaining the CI/CD pipeline sets us aside from normal developers, just because our services run in dev, doesn’t mean they have dev SLA’s. Even though there may be major pain points in the CI infrastructure, if you pick a low risk item and do it right it is more beneficial than picking a high risk high reward change and having a troubled roll out. This technique is similar to Owning the Process, except now you are dealing with the internals of the CI infrastructure, not just rolling code to production.

Conclusions

Since your time and interactions with the outgoing team may be limited be choosy about what questions you ask for transition. First, figure out how code gets to production. Second, figure out how to own that process.Third, find all the manual steps and understand why they are manual. Finally, when you understand what the system is doing try to affect change for the better by starting small.

Backup – Getting data into Glacier

Backups

Off site backups are an often talked about and rarely done well item for small to medium enterprises. This is usually due to the cost of an offsite facility storage, complexity of backup software, and operational costs. However, offsite backups are critical to keep an enterprise running if an unfortunate event happens to hit the physical computing facility. Often hardware is a trip to BestBuy or the Apple store or Insert your favorite chain here,  but without your data what good is it? AWS offers a low cost solution to this problem with only a little bit of java and shell script knowledge. This post is split into two parts, getting the data in and getting the data out. Currently we will deal with getting your data safely into glacier.

Glacier – What it is and is not

Glacier is a cheap service specifically for bulk uploads and downloads where timeliness is not of the essence. Inventories are taken every 24 hours of your vaults, and it can take hours to get an archive back. To save yourself grief do not think of glacier as disk space, but rather a robot that can get you your data from a warehouse. For disk space S3 and EBS are what you want in the amazon world.

Decide what to back up

This part is usually unnecessarily difficult. The end product is a list of directories that will be fed to a shell script. You do not need everything. Typically you just need user data, transient data, and OS configs. Think of whatever the delta is from a stock OS image to what you have now…. back that delta up. If you want to backup a database, make sure you are using a dump and not relying on the file system to be consistent.

Stage/Chunk it

Here is a script that can do it:
 
 #!/bin/bash
 CPUS=`cat /proc/cpuinfo | grep processor | wc -l`
 DATE=`date +%m%d%Y-%H%M%S`
 RETENTION=2
 BACKUP_PREFIX="/home/nfs_export/backups"
 BACKUPDIR="$BACKUP_PREFIX/$HOSTNAME/inflight/$DATE"
 dirs=(/root /sbin /usr /var /opt /lib32 /lib64 /etc /bin /boot)
 if [ ! -d "$BACKUP_PREFIX/$HOSTNAME/complete" ] ; then `mkdir $BACKUP_PREFIX/$HOSTNAME/complete` ; fi
 `mkdir -p $BACKUPDIR`
 mount /boot
 let CPUS++
 SPLITSIZE=`cat /proc/meminfo | grep MemTotal: | sed -e 's/MemTotal:[^0-9]\+\([0-9]\+\).*/\1/g'`
 SPLITSIZE=$(($SPLITSIZE*1024))
 SPLITSIZE=$(($SPLITSIZE/$CPUS))
 SPLITSIZE=$(($SPLITSIZE/2))
 for dir in "${dirs[@]}"
 do
 TMP=$dir
 TMP=`echo $TMP | sed -e 's/\//_/g'`
 echo "(cd $BACKUPDIR; tar cvf - $dir | split -b $SPLITSIZE - backup_$TMP)"
 `(cd $BACKUPDIR; tar cvfp - $dir | split -b $SPLITSIZE - backup_$TMP)`
 done 
 echo "(cd $BACKUPDIR; find . -type f | xargs -n 1 -P $CPUS xz -9 -e )"
`(cd $BACKUPDIR; find . -type f | xargs -n 1 -P $CPUS xz -9 -e )`
`(cd $BACKUP_PREFIX/$HOSTNAME/inflight; mv $DATE ../complete)`
umount /boot
i=0
for completedir in `(cd $BACKUP_PREFIX/$HOSTNAME/complete; ls -c)`
do
echo "$completedir $i $RETENTION"
if [ $i -gt $RETENTION ] ; then
echo "(cd $BACKUP_PREFIX/$HOSTNAME/complete; rm -rf $completedir)"
`(cd $BACKUP_PREFIX/$HOSTNAME/complete; rm -rf $completedir)`;
fi
let i++
done

This script makes alot of decsions for you, all you really need to do is decide where you are staging your data and what directories you are backing up. This script will create a multi part archive that will be pretty efficient to produce on the hardware it is executed on.

Speaking of staging, you will need formatted disk space to hold the archive while it is being sent up to AWS. Typically you want to be able to hold a weeks worth of backups on the partition. Why a week? This is breathing room to solve any backup issues while not loosing continuity of your backups.

Possibly encrypt it

Take a look at the shell script. If you are extra paranoid use gpg to encrypt each archive piece AFTER they have been compressed. Due to the nature of encryption, you will negate the ability of the compression algorithm to work if you encrypt ahead of time.

Get it into Glacier

First we need to create a vault to put our stuff. Log into your AWS console and go to your glacier management. Select the create a new vault option and remember the name of the vault. Then go to IAM and create a backup user remebering to recorde the access and secret key. Now we are ready to start.

This is custom java program snippet based on the AWS sample code:

for (String s : args) {
try {
ArchiveTransferManager atm = new ArchiveTransferManager(client, credentials);
UploadResult result = atm.upload(vaultName, "my archive " + (new Date()), new File(s));
System.out.println(s + " Archive ID: " + result.getArchiveId());
}
}

The complete code and  pom file to build it are included in the git repository. The pom should compile this program into a single jar that can be executed with the java -jar myjar.jar command first we need to configure the program. Create a properties file in the directory you will be running the java command with the name AwsCredentials.properties to have your secret key, access key, vault name, and aws endpoint. It should look like this:

 
 #Insert your AWS Credentials
 secretKey=***mysecretkey***
 accessKey=***myaccesskey***
 vaultname=pictures
#for the endpoint select your region
 endpointname=https://glacier.us-east-1.amazonaws.com/



Lastly feed the java program a list of the archive files, perhaps like this:

cd stage; ls *| xargs java -jar glacier-0.0.1-SNAPSHOT.one-jar.jar



This will get all the files in the staging directory up to your vault. In 24 hours you will see an inventory report with the size and number of uploads. Remember to save the output, the archive IDs are used to retrieve the data later. Some amount of internal book keeping will need to be done to keep the data organized. Amazon provides a safe place for data in this case, and not an easy way to index or find things.

To be Continued….

Next post… getting it all back after the said Armageddon.

Code

All code is available, excuse the bad SSL cert:

https://coveros.secureci.com/git/glenn_glacier.git

DevOps, Opensource, CI, and the Cloud

CI is a Reality and not Just for Google and Facebook

I’ve worked for many recognizable names and they all have time and energy invested in a software process that only shows results once a quarter. Hundreds of thousands or millions of dollars and you can only see the results once ever 3 months… maybe. Each time the releases have been painful, long, and fraught with errors and frustration. To alleviate the pain of deployment, Continuous Integration and Continuous Delivery are the answer. When ever the idea of a faster release schedule is introduced, usually by a business actor, most IT organizations push back. Typically the reasons are “What we did worked last time so we don’t want to change.”,  “The tools are too expensive”, and “We are not google or facebook”. As I will demonstrate the second argument is no longer valid. Using a set of free open source tools Continuous Integration and Continuous Delivery are easily achievable with some time and effort.

Example

The linked presentation and video example are a demonstration open source CI.

MS PowerPoint (DevOps_Opensource_CI_and_the_Cloud.pptx)

LibreOffice Impress (DevOps_Opensource_CI_and_the_Cloud.odp)

Youtube (https://www.youtube.com/watch?v=gIxCcJAl86M)

MP4 (https://s3.amazonaws.com/aws_wordpress/Coveros+Puppet+and+CI-voiceover.mp4)

The Scenario

In order to backup my statement I needed to prove it out with a non-trivial example. For the purposes of this demonstration I chose to show changes in a drupal site. The work flow could be characterized as follows:

  1. Developer creates tests
  2. Developer creates code
  3. Developer commits code and tests
  4. CI Framework detects changes
  5. CI Framework stands up empty VMs and databases
  6. CI Framework pushes down data, code, and configuration to the blank VMs
  7. CI Framework runs automated tests on the site and returns results

This process is completely hands off from the developer commit to the test results. No ops group needs to be contacted, no changes need to be filed, no e-mails need to be sent to the QA group alerting them to changes. The drupal instance is a multi-box, separate DB and Webserver, the example is not just a out of the box apache web server with the “it works” page.

The Tool Set

The software you use will vary from implementation to implementation, the software listed in this section is just what was needed for the demo. One size does not fit all, however, with the enormous amount of open source projects available there are basic tools for almost any need. Jenkins and Puppet are really the center pieces in this particular demonstration. Jenkins is the control point and executes the jobs and coordinates everything. Puppet is the CM database holding the necessary information to configure a VM and each application on each VM. PERL and sed are used for text manipulation, taking the output from the EC2 API or the puppet REST interface and turning the output into a usable format for Jenkins. EC2 itself is the infrastructure and elastic capacity for testing.

Path to CI and DevOps

Two words, “small steps”. Depending on the development organization you are working in or with trying to do this all at once it risky. It is easier to lay out a road map of implementing each individual technology and then integrating them. Start with OS packaging. Even if your deployment process is manual, developers and ops people can usually latch on to this idea. The two most challenging pieces are typically automated testing and hands off deployments. One requires a change in the way the testing organization works and the other may require a change to policy and personnel of the operations group. However, as more and more CI pieces get implemented, organizationally these changes will make more sense. In the end the inertia built up by increased efficiency and better releases may be the best leverage.

Conclusion

The three arguments against CI are typically cost, what we did worked last time, and while it works for other organizations it can’t work for us. From the demo you can see that CI can work and can be implemented with opens source tools negating the cost argument. While you have to invest in some (not alot) engineering effort, the cost of the tools themselves should be close to zero. Secondly, if you release cycle is more than 2 weeks – 3 weeks from written idea to implementation your process does not work. While there may be various IT heroes that expend monumental effort to make the product function, your development process does not work in a healthy sense negating the “what we do works” argument. Lastly is the argument that since “we” don’t have the resources of google or facebook we can’t possibly adopt a technology this complicated. The demo above was done by one person in about a weeks worth of effort. The tools are available and they use standard technologies. All the pieces exist and are well tested so you don’t need an engineering staff the size of google to implement CI, refuting some of the most common arguments against CI and DevOps.

broadcom-sta and Gentoo

I have a dell E6520 and I have come quite attached to my wireless. Recently I’ve upgraded my kernel to the 3.6 series and things went almost flawlessly except for the wireless driver. The driver wl.ko kernel module barfed in kernel land and did very very bad things. I was seeing something similar to this in the logs:


general protection fault: 0000 [#1] PREEMPT SMP
Modules linked in: nls_cp437 vfat fat usb_storage uas bbswitch(O) uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media tg3 libphy mei i2c_i801 lib80211_crypt_tkip wl(PO) cfg80211 lib80211 microcode acer_wmi sparse_keymap rfkill mxm_wmi pcspkr wmi ghash_clmulni_intel cryptd kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm coretemp iTCO_wdt iTCO_vendor_support crc32c_intel snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd serio_raw sdhci_pci sdhci mmc_core soundcore lpc_ich psmouse joydev evdev battery ac acpi_cpufreq mperf processor ext4 crc16 jbd2 mbcache hid_generic hid_logitech_dj usbhid hid sr_mod sd_mod cdrom ahci libahci libata ehci_hcd scsi_mod usbcore usb_common i915 video button i2c_algo_bit drm_kms_helper drm i2c_core intel_agp
intel_gtt [last unloaded: nvidia]
NetworkManager[1733]: wpa_supplicant stopped
NetworkManager[1733]: (wlan0): supplicant interface state: inactive -> down
NetworkManager[1733]: (wlan0): device state change: disconnected -> unavailable (reason 'supplicant-failed') [30 20 10]
NetworkManager[1733]: (wlan0): deactivating device (reason 'supplicant-failed') [10]
systemd[1]: wpa_supplicant.service: main process exited, code=killed, status=11
kernel: CPU 0
kernel: Pid: 1737, comm: wpa_supplicant Tainted: P O 3.6.2-1-ck #1 Acer Aspire 5750G/JE50_HR
kernel: RIP: 0010:[] [] wl_cfg80211_scan+0x8c/0x480 [wl]
kernel: RSP: 0018:ffff880159eb5978 EFLAGS: 00010202
kernel: RAX: ffffffffa085f290 RBX: ffff8801580cd200 RCX: ffff8801580cd200
kernel: RDX: ffff8801580cd200 RSI: ffff88013c912000 RDI: ffff8801580cd200
kernel: RBP: ffff880159eb59b8 R08: 00000000000162c0 R09: 000000000000007c
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0084161c00000001
kernel: R13: ffff88013c912000 R14: ffff88013c912000 R15: 0000000000000000
kernel: FS: 00007f37e3427700(0000) GS:ffff88015fa00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000001bdb6e8 CR3: 000000013c839000 CR4: 00000000000407f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: Process wpa_supplicant (pid: 1737, threadinfo ffff880159eb4000, task ffff88015823a080)
kernel: Stack:
kernel: ffff880159eb5a18 ffffffffa03fd5d8 ffff880159eb59c8 ffff880159eb5a38
kernel: ffff8801580cd000 0000000000000001 ffff88013c912000 0000000000000000
kernel: ffff880159eb5a18 ffffffffa04022f5 000000000000007c 0000000000000004
kernel: Call Trace:
kernel: [] ? nl80211_pre_doit+0x318/0x3f0 [cfg80211]
kernel: [] nl80211_trigger_scan+0x485/0x610 [cfg80211]
kernel: [] genl_rcv_msg+0x298/0x2d0
kernel: [] ? genl_rcv+0x40/0x40
kernel: [] netlink_rcv_skb+0xa1/0xb0
kernel: [] genl_rcv+0x25/0x40
kernel: [] netlink_unicast+0x19d/0x220
kernel: [] netlink_sendmsg+0x30a/0x390
kernel: [] sock_sendmsg+0xda/0xf0
kernel: [] ? find_get_page+0x60/0x90
kernel: [] ? filemap_fault+0x87/0x440
kernel: [] __sys_sendmsg+0x371/0x380
kernel: [] ? handle_mm_fault+0x249/0x310
kernel: [] ? do_page_fault+0x2c4/0x580
kernel: [] ? restore_i387_xstate+0x1af/0x260
kernel: [] sys_sendmsg+0x49/0x90
kernel: [] system_call_fastpath+0x1a/0x1f
kernel: Code: 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 90 48 8b 86 48 02 00 00 48 85 c0 0f 84 6c 03 00 00 4c 8b 20 4d 85 e4 0f 84 2f 03 00 00 8b 84 24 a8 0a 00 00 4d 8b b4 24 48 06 00 00 a8 02 75 60 49
kernel: RIP [] wl_cfg80211_scan+0x8c/0x480 [wl]
kernel: RSP
kernel: ---[ end trace 62b60f7a71b18301 ]---

The solution I implemented was taken from the gentoo bugtracker and fourms. Below is the reference information.
http://forums.gentoo.org/viewtopic-t-939648-start-0.html
https://bugs.gentoo.org/show_bug.cgi?id=437898
and the solution here:
https://437898.bugs.gentoo.org/attachment.cgi?id=326502

At a high level I created a local overlay, created my own broadcom-sta ebuild and re-emerged the driver.

This is how I implemented the path step by step:

  • vi /etc/make.conf

add:
PORTDIR_OVERLAY="
/var/lib/localoverlay
$PORTDIR_OVERLAY

  • mkdir -p /var/lib/localoverlay/net-wireless/broadcom-sta
  • cd /var/lib/localoverlay/net-wireless/broadcom-sta
  • cp /usr/portage/net-wireless/broadcom-sta/broadcom-sta-5.100.82.112-r2.ebuild /var/lib/localoverlay/net-wireless/broadcom-sta/broadcom-sta-5.100.82.112-r3.ebuild
  • vi /var/lib/localoverlay/net-wireless/broadcom-sta/broadcom-sta-5.100.82.112-r3.ebuild
  • I manually added the patch at https://437898.bugs.gentoo.org/attachment.cgi?id=326502.
  • emerge -v broadcom-sta

– done

Enjoy!

Attack of the Zombie ssh client

Passively sitting and watching my logs I notice the following repeated thousands of times:

Dec 4 08:12:59 apache sshd[15823]: SSH: Server;Ltype: Version;Remote: <someip>-34052;Protocol: 2.0;Client: libssh-0.1
Dec 4 08:12:59 apache sshd[15823]: SSH: Server;Ltype: Kex;Remote: <someip>-34052;Enc: aes128-cbc;MAC: hmac-sha1;Comp: none [preauth]
Dec 4 08:12:59 apache sshd[15823]: SSH: Server;Ltype: Authname;Remote: <someip>-34052;Name: root [preauth]
Dec 4 08:12:59 apache sshd[15823]: Received disconnect from <someip>: 11: Bye Bye [preauth]

Seeing a thousand of anything in my log file besides cron is disconcerting since I run fail2ban. After some research I found the following article: http://taint.org/2008/05/16/165301a.html

According to the article this is insidious because the attack doesn’t log a failure. Its trying to break the host ssh key so it aborts mid transaction. Rather than subject myself to this I figured I could try and add a fail2ban rule and block further attempts. In my /etc/fail2ban/filter.d/sshd.conf file I added the following line to the failregexp:

^%(__prefix_line)sReceived disconnect from <HOST>: 11: Bye Bye \[preauth\]\s*$

Its not perfect but it does what I want. The down side is if you log out legitimatly your fail2ban tolerance in the watch period you will be banned. I’m ok with that imitation. One more attack down… ugh.

 

 

 

Linux 3.6 and EC2

EC2 is xen based. For the most part people are using ancient kernel that has been back patched all to wazoo. 2.6.18 and 2.6.28 being really popular. With my laptop sitting in the 3.5 series I was hoping to get my clould VMs somewhere in that range as well. I know that xen went native in the kernel in the 3 series so it is possible.

I lined up my caffeine and started to prepare for a long series of failed boots as I tinkered with my kernel setting moving my 2.6.28 config up into the 3.x series of kernels. Then, out of sheer whimsy I decided to see what the gentoo overlays had in store. Lets see if someone else can do the heavy lifting. Sure enough the sabayon-distro overlay is just what the doctor ordered. In it there is a ebuild for the 3.6 kernel with the appropriate config.

Since I mint my own EC2 images from scratch I have a chroot environment on a server at home to build said image. Before you embark this blog post large body of work is assumed to havebeen done. Specifically you are familiar with Gentoo, that you know how layman works in Gentoo, and lastly you know how to gen up an EC2 gentoo image from thin air.


chroot <path to my ec2 image> /bin/bash
layman -s sabayon-distro
emerge -av ec2-sources
eselect kernel list
eselect kernel set <number next to linux-3.6.0-ec2>
cd /usr/src/linux
#this is a hack to make sure genkernel gets the right config
cp .config /usr/share/genkernel/arch/x86_64/kernel-config
genkernel --makeopts=-j<number of cpus +1> all
#in /boot kernel-genkernel-x86_64-3.6.0-ec2 initramfs-genkernel-x86_64-3.6.0-ec2 should now exist

There are two paths to follow now, you are either upgrading an existing system or you are creating a new ami from scratch. and uploading it. First I will cover the upgrade scenario.

Upgrade

Here is the tricky part, if you are upgrading from 2.6.x to the 3.x series of kernels the device names change for hard drives.  You have two options, go into the inner workings of udev and make drives show up as /dev/sd[a-z]+[0-9]+ or modify grub and fstab accordingly. I went with the latter. First backup your instances’ system drive b snapshotting the current running volume. This way you can get back to the original VM by creating a new AMI. Next, I copied the kernels up to my EC2 instance from my chroot vm work environment and placed them in /boot on my EC2 machine. Next I need to move up the kernel modules.

In my chroot environment:

cd /lib/modules
tar cvf modultes.tar 3.6.0-ec2
scp modules.tar root@myec2instance:/lib/modules

SSH to the EC2 instance:

cd /lib/modules
tar xvf modules.tar

I then changed all the /dev/sda[0-9] entries in /etc/fstab to /dev/xvda1 and made the same change to /boot/grub/menu.1st

kernel /boot/kernel-genkernel-x86_64-3.6.0-ec2 root=/dev/sda1 rootfstype=ext3

changed to:

kernel /boot/kernel-genkernel-x86_64-3.6.0-ec2 root=/dev/xvda1 rootfstype=ext3

Next you reboot.  If all goes well delete your backup snapshot.

New AMI

If you want a new AMI, just make the changes to /etc/fstab and /boot/grub/menu.1st in your chroot environment ant follow the procedure for beaming it all up there in one of the EC2 custom image links listed above.

In Closing

Not only do you have a way to upgrade to 3.6, but now you can make continual kernel tweekes to your EC2 instance in a relatively safe manner.

Posted in AWS

Random Enough Passwords

OK, as a side duty to many of the roles I fill, I wind up installing and administering countless small apps, vms, and physical machines. I don’t want a system I created to be hacked because 1234 was a secure enough password. One side effect of this is that I must now use a password manager and back it up. God help me if I loose that file. Additionally, this has caused my internal entropy for generating passwords to drop to 0. In other words, I’m tired of thinking of random passwords. Thanks to an article here: http://blog.colovirt.com/2009/01/07/linux-generating-strong-passwords-using-randomurandom/ , now I don’t have to. To sum it up:


#!/bin/bash
PASSLEN=5
cat /dev/random| tr -dc 'a-zA-Z0-9-_!@#$%^&*()_+{}|:<>?='|fold -w $PASSLEN| head -n 4| grep -i '[!@#$%^&*()_+{}|:<>?=]'

This script will produce a length 5 password and give you 4 different passwords to choose from. This will generate a really random set of passwords, but you must generate entropy by using your system. If this is to slow for you use /dev/urandom instead since random blocks until things are random enough. This is not statically perfect so don’t use this for anything requiring true randomness. If you don’t know what /dev/random or /dev/urandom are this is not the post for you.

-Glenn

svn L&P testing in 10 lines or less

I needed a quantifiable test that can measure svn performance during a check out. This script take 2 arguments, number of checkouts and parallelism. For example if I want to run 100 checkout 2 at a time ./load.sh 100 2 or 100 checkouts 50 at a time ./load.sh 100 50.


#!/bin/bash
i=0;
url="http://mysvnrepo"
while [ $i -lt $1 ] ; do mkdir $i; let i=$i+1; done
DATE=`date +%m%d%y%H%M%S`;
find -type d ! -name . -maxdepth 1 2> /dev/null | sed "s/\.\///g" | xargs -I'{}' -P$2 time -o {}/time.dat svn co $url {}
find -iname time.dat -exec cat {} >> total_$1_$2_$DATE.dat \;
cat total_$1_$2_$DATE.dat | grep -v swaps | sed "s/user /\t/g" | sed "s/system /\t/g" | sed "s/elapsed.*//g" | sort -n > res_$1_$2_$DATE.csv

i=0;
while [ $i -lt $1 ] ; do rm -rf $i; let i=$i+1; done

 

The results are recorded in a file with both test parameters and the date. A little bit of sed magic and you can create a csv which will make pretty graphs in excel or libre office calc. Enjoy.
-Glenn

Posted in L&P