Problems Only Tyler Has

Coworkers of mine have told me that I come across some of the weirdest problems they have ever heard of.  They also suggested that I put them online and blog about it so people can find solutions to these problems ... if anyone else in the world even has them.  Let's see if anyone else really has software-related issues like I do.

gpg --recv-keys not working with CentOS 6 and hkps

posted Feb 26, 2016, 7:32 AM by Tyler Akins

I have a need at work to import gpg keys automatically from keyservers.  To make sure I trust the keys, I fully intended to use hkps keyservers.  To that end I found these:
  • hkps://keyserver.ubuntu.com
  • hkps://hkps.pool.sks-keyservers.net (custom CA cert)
  • hkps://zimmermann.mayfirst.org (custom CA cert)
So far so good.  I installed the two CA certs and I can use curl to hit https://SERVER/pks/lookup?search=0x8F3B8C432F4393BD in order to make sure they work.  For those who do not know, hkps is "HTTP Keyserver Protocol" as defined in this proposal.  Thus, I can just hit the URLs directly, thankfully, because it soon became quite necessary.

Time to get the GPG command working.  Here's what works on all of the systems I tried EXCEPT CentOS 6.  Debugging options (--verbose --keyserver-options debug) were added to get additional diagnostic information.

gpg --verbose --status-fd 1 --keyserver hkps://keyserver.ubuntu.com --keyserver-options debug --recv-key 8F3B8C432F4393BD

Typically works like a charm.  CentOS 6 has this error, regardless of which server I use.

gpg: requesting key 2F4393BD from hkps server keyserver.ubuntu.com
gpgkeys: curl version = libcurl/7.19.7 NSS/3.19.1 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2
* About to connect() to keyserver.ubuntu.com port 443 (#0)
*   Trying 91.189.90.55... * connected
* Connected to keyserver.ubuntu.com (91.189.90.55) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: none
  CApath: none
* Certificate is signed by an untrusted issuer: 'CN=DigiCert SHA2 Secure Server CA,O=DigiCert Inc,C=US'
* NSS error -8172
* Closing connection #0
* Peer certificate cannot be authenticated with known CA certificates
gpgkeys: HTTP fetch error 60: Peer certificate cannot be authenticated with known CA certificates
gpg: no valid OpenPGP data found.
[GNUPG:] NODATA 1
gpg: Total number processed: 0
[GNUPG:] IMPORT_RES 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Wait a sec, that looks like a CA cert problem.  I can curl to the URL directly from CentOS 6 and I also used openssl to manually make an HTTP request.  Works every time, except from gpg.  I did not dig deep enough to see why gpg does not work because I was much more interested in having this fixed and CentOS 6 is already quite ancient.  So, here's what I did as a workaround.

curl https://keyserver.ubuntu.com/pks/lookup?search=0x8F3B8C432F4393BD\&op=get | gpg --import --status-fd 1

And you can see here that it works perfectly fine:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3230  100  3230    0     0  18737      0 --:--:-- --:--:-- --:--:--  175k
gpg: /home/centos/.gnupg/trustdb.gpg: trustdb created
gpg: key 2F4393BD: public key "Tyler Akins <fidian@rumkin.com>" imported
[GNUPG:] IMPORTED 8F3B8C432F4393BD Tyler Akins <fidian@rumkin.com>
[GNUPG:] IMPORT_OK 1 164090D5B9551478BE7F25588F3B8C432F4393BD
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
[GNUPG:] IMPORT_RES 1 0 1 1 0 0 0 0 0 0 0 0 0 0

Problem solved.

dhclient not honoring prepend config

posted Nov 2, 2015, 5:52 AM by Tyler Akins   [ updated Nov 2, 2015, 5:54 AM ]

I have a CentOS 6.7 image running in AWS.  It reads /etc/dhcp/dhclinet.conf - here is mine:

timeout 300;
retry 60;
prepend domain-name-servers 127.0.0.1;
prepend domain-search "node.consul";

Please note that I fixed the first two lines!  The stock version of this file did not have semicolons at the end of the timeout nor retry lines.

The idea is that I would prefer to use the local dnsmasq before falling back to other domain name servers.  Sounds like a typical use case, right?  I can use "dhclient -r ; dhclient" to release and renew the DHCP lease and I see the entry in my /etc/resolv.conf exactly as I would expect.

nameserver 127.0.0.1
# other nameservers listed later in the file.

I believe that this works just fine and so I go ahead and reboot the box.  Just to be sure, I double check the /etc/resolv.conf file and find out, to my horror ... the line is missing!

Where did it go?  It totally worked before!  Running "dhclient -r ; dhclient" again puts the line back.  What's the deal?

It turns out that the version of dhclient that's installed (version 4.1.1 49.P1.el6.centos) is not properly setting $new_domain_name_servers for the REBOOT reason.  On reboot, dhclient will talk DHCP to a server and get a new lease.  It fires off /sbin/dhclient-script with $reason set to REBOOT.  When I use "dhclient -r ; dhclient" it does almost the same thing but $reason is set to BOUND.

The strange thing is that the environment variables are different for those two calls.  For REBOOT the $new_domain_name_servers does not list 127.0.0.1 and for BOUND it does list 127.0.0.1.  It should always have 127.0.0.1 because we have the "prepend domain-name-servers 127.0.0.1" config set.

I tried taking a peek at the source code but did not invest enough time to determine the cause for this issue.  I mostly gave up for these reasons:
  1. It wasn't obvious when I browsed the source code.  Investigating would take hours of work and I would likely have to add many debug lines and trace the execution manually.
  2. I would have to update dhclient.  This change wouldn't really roll out to older machines, which is exactly where I need this fix.
  3. Newer machines are switching away from dhclient.  Instead, they use NetworkManager or other alternatives.
  4. There's a quick and easy workaround to make this act as expected.
Let's talk about that workaround.  On CentOS, the /sbin/dhclient-script script is made to be extended.  It looks for /etc/dhcp/dhclient-enter-hooks and will execute it if it exists and is flagged as executable.  You can modify $new_domain_name_servers directly here.  So, just omit the "prepend domain-name-servers 127.0.0.1" in /etc/dhcp/dhclient.conf and instead you should create /etc/dhcp/dhclient-enter-hooks with the content here.

#!/bin/sh
# Prepend 127.0.0.1 to the list of name servers if new servers are being set.
if [ -n "$new_domain_name_servers" ]; then
    new_domain_name_servers="127.0.0.1 $new_domain_name_servers"
fi

A simple "chmod 0755 /etc/dhcp/dhclient-enter-hooks" and you're done.  This will always prepend 127.0.0.1 to your list of domain name servers.  The same method can work for all sorts of properties that dhclient is having difficulty honoring.

Problem solved.

It is possible that this isn't a "Problem Only Tyler Has".  Here's a few people that could be the result of the same issue as me or possibly a related issue.  They didn't solve it the same way I did and I didn't investigate their problems further to determine if they were really experiencing the same issue as myself.

IE8 <div> Height Changing

posted Oct 2, 2012, 1:46 PM by Tyler Akins   [ updated Oct 2, 2012, 2:48 PM ]

This was a problem that stumped me for quite some time.  I'm working to create a pagination plugin where you have a single parent <div class="results"> that contains several tile <div class="tile"> elements.  Basically, the structure looks a little like this:

<div class="results" style="overflow: hidden; position: relative">
    <div class="resultsWrapper">
        <div class="tile">Result # 1</div>
        <div class="tile">Result # 2</div>
        ...
        <div class="tile">Result # 100</div>
    </div>
</div>

I add some styles to div.results to make it only show a few tiles at a time.  Because the tiles can have a variable height, I use jQuery to calculate this:

// Error detection and bounds checking removed for clarity
var page = 3;  // zero-based indexing
var perPage = 5;
var children = $('div.results').children().children();  // Get the tiles
var firstChildTop = Math.floor(children.get(0).position().top);
var firstVisibleTop = Math.floor(children.get(page * perPage).position().top);
var lastVisibleBottom = Math.floor(children.get((page + 1) * perPage).position().top);
// Show the divs on this page
$('div.results').animate({ height: lastVisibleBottom - firstVisibleTop });
$('div.results').children().animate({ marginTop: firstVisibleTop - firstChildTop });

Remember, this is just an example to help illustrate what I am trying to do.  You'll need quite a bit more code to make a working pager plugin for jQuery.  Anyway, so this will appear to the browser that there's a sliding series of div.tile elements moving to the "page" that you are on.  With the "overflow: hidden" and the negative margin, this acts like a little window seeing just a portion of the larger div.resultsWrapper that is sliding around to show just what we need.

Except in IE8.  It's also not the case sometimes in IE9 when rendering in IE8 mode, but only sometimes.

The problem boils down to the heights of the elements.  When IE8 slides the div.resultsWrapper up, the div.tile elements forget their heights.  It's crazy, but you could have some JavaScript like this to show the heights:

var h = 'Heights: ';
$('div.tile').each(function () {
    h += ' ' + $(this).height();
});
console.log(h);

You'll see output like this when at the top of the list:

Heights:  212 197 197 202 212 207 ...

Now use a little jQuery magic to scroll down by setting a negative margin-top CSS property on div.resultsWrapper.  Let's say you scrolled down so just a little of the bottom of the fourth element is shown.  Move your mouse over the div.results element.  Now, run that JavaScript again that shows the heights.  I was seeing this:

Heights:  47 47 47 768 212 207 ...

The height of the first three shrunk to just the padding I had on div.tile and the fourth tile strangely sucked up most (but not exactly all) of the height that was missing.  You can move back to the top and the content is messed up until you mouse over div.results.  I set a global breakpoint and no JavaScript runs when I mouse over div.results, yet that's still when the heights changed.  After much trial and error, I found that the contents of the tiles were to blame.  Here's closer to what my tiles looked like, and I bet you'll start to get a feel for where the problem lies.

<div class="tile">
    <div class="productImage" style="float: left"><img src="..."></div>
    <div class="productDescription" style="float: left">This is result #1</div>
    <div class="clear" style="clear: both"></div>
</div>

My divs used "float: left" to position them inside the div.tile element properly.  This works well in all browsers and looks great even in IE8 and IE7 (I have no need to go lower).  The only browser that chokes is IE8.  It must do something when the div.tile elements are above the visible area and it just doesn't keep them loaded or positioned properly.  This feels a lot like another type of "peekaboo bug" that has plagued IE with floats ever since they were introduced in that browser.

The fix:  Do not use float.  Yep, I tried several variations, but nothing ever worked with dynamically sizing content and floats.  In the end "float: left" was replaced with "display: inline-block" and it again looks perfect in all browsers.

WCF and gzip compression

posted Sep 25, 2012, 3:13 PM by Tyler Akins

I was helping to diagnose a problem where web requests to a service were being troublesome.  It always enabled compression on the output stream, whether or not the client asked for it.  Normally that is not a problem.  We were using PHP to make SOAP calls and tied that to PHP's curl library because we had some special requirements regarding request and response headers that were necessary.

PHP's SOAP library (when fetching via the curl module) was saying that there was no response or that there were problems decompressing the stream.  Wget did not work.  The curl command-line tool worked.  Using a sniffer on the network showed me that data was coming across the wire.  When that data was written to disk, gzip would not decompress it but zcat would.

Everything worked like a charm when compression was disabled, but it was absolutely necessary that the compression was enabled and forced on in our production environment.

We more carefully analyzed the responses from the server and found that there was random-ish looking data (as is expected) for most of the response and then perhaps about 1/3 is NULL bytes or (even worse) XML from some sort of SOAP request.  It looks like we're leaking memory contents.  Very undesirable.

We obtained the source code at about the time that I noticed all response lengths were powers of 2:  256 bytes, 512 bytes, 1k, 2k, 4k, 8k.  We're sending back some sort of buffer that was allocated.  Here's the code that was affected -- you may notice it looks a lot like many other copies of this code on the web.

//Helper method to compress an array of bytes
static ArraySegment<byte> CompressBuffer(ArraySegment<byte> buffer, BufferManager bufferManager, int messageOffset)
{
MemoryStream memoryStream = new MemoryStream();
memoryStream.Write(buffer.Array, 0, messageOffset);

using (GZipStream gzStream = new GZipStream(memoryStream, CompressionMode.Compress, true))
{
gzStream.Write(buffer.Array, messageOffset, buffer.Count);
}


byte[] compressedBytes = memoryStream.ToArray();
byte[] bufferedBytes = bufferManager.TakeBuffer(compressedBytes.Length);

Array.Copy(compressedBytes, 0, bufferedBytes, 0, compressedBytes.Length);

bufferManager.ReturnBuffer(buffer.Array);
ArraySegment<byte> byteArray = new ArraySegment<byte>(bufferedBytes, messageOffset, bufferedBytes.Length - messageOffset);
return  byteArray;
}

This actually comes from one version of an example that Microsoft produced.  In our case we thought it was Iconic.Zlib but the above code uses System.IO.Compression.GZipStream, so it isn't related to the compression library.  That works like a charm.  What's broken about this code is the byteArray and how many bytes are copied to it.  That last line should instead look like this:

ArraySegment<byte> byteArray = new ArraySegment<byte>(bufferedBytes, messageOffset, compressedBytes.Length);

Once you make this change, your HTTP responses should no longer be exactly equal to powers of 2.  You can double-check this by looking for the Content-Length headers when you sniff the traffic or use some tool that will show you the full response headers.

I hope that others can spread this good knowledge out to the various other forums for when people have problems with this.  I believe that this is the reason that Chrome has issues with compressed data when people are doing things like this.  I found forum postings mentioning that Chrome is extra picky about compressed data and how compressed data from some C# services were not working in Chrome.

Chef Upgrade Issue

posted Jun 18, 2012, 6:10 AM by Tyler Akins   [ updated Jun 18, 2012, 6:13 AM ]

Once again, I had a problem with Opscode Chef, but for a very understandable reason.  First, while trying to spin up an instance, I see messages like this at the end.

ERROR: Server returned error for http://ec2-50-17-230-193.compute-1.amazonaws.com:4000/cookbooks/phpunit/0.9.1/files/1ac61a28fa057aeb34ca4e5071e9c96c, retrying 2/5 in 8s
ERROR: Server returned error for http://ec2-50-17-230-193.compute-1.amazonaws.com:4000/cookbooks/phpunit/0.9.1/files/1ac61a28fa057aeb34ca4e5071e9c96c, retrying 3/5 in 16s
ERROR: Server returned error for http://ec2-50-17-230-193.compute-1.amazonaws.com:4000/cookbooks/phpunit/0.9.1/files/1ac61a28fa057aeb34ca4e5071e9c96c, retrying 4/5 in 29s
ERROR: Server returned error for http://ec2-50-17-230-193.compute-1.amazonaws.com:4000/cookbooks/phpunit/0.9.1/files/1ac61a28fa057aeb34ca4e5071e9c96c, retrying 5/5 in 53s
ERROR: Running exception handlers
FATAL: Saving node information to /var/chef/cache/failed-run-data.json
ERROR: Exception handlers complete
FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
FATAL: Net::HTTPFatalError: 500 "Internal Server Error"

I then peeked into /var/log/chef/server.log and saw some messages. Not quite helpful, but perhaps a clue?

merb : chef-server (api) : worker (port 4000) ~ Params: {"cookbook_version"=>"1.0.0", "action"=>"show_file", "cookbook_name"=>"apache2", "checksum"=>"ab1792e9de7461ddf4861e461c0c8a24", "controller"=>"cookbooks"}
merb : chef-server (api) : worker (port 4000) ~ undefined method `file_location' for # - (NoMethodError)

The file_location was set correctly, so now I am stumped. I restarted chef server, rebuilt the database as I mentioned on a previous blog post and uploaded everything again to the server. No luck with any of those. The failure point wasn't always on the same package. It seemed to hop around to different packages at different times.

So, now I check versions of the packages that are installed.

19:32 utilities:/tmp$ sudo dpkg -l | grep chef
ii  chef                              0.10.8-2                                   A systems integration framework, built to br
ii  chef-expander                     0.10.4-1                                   A systems integration framework, built to br
ii  chef-server                       0.10.4-1                                   A meta-gem to install all server components
ii  chef-server-api                   0.10.4-1                                   A systems integration framework, built to br
ii  chef-server-webui                 0.10.4-1                                   A systems integration framework, built to br
ii  chef-solr                         0.10.4-1                                   Manages search indexes of Chef node attribut

You'll see that one package is at 0.10.8 and the rest are all 0.10.4. Could that be it? Reinstalling the chef package didn't force upgrades of the others, so I just manually used apt-get to upgrade the other chef packages and it started to work again.

Problem solved.

Backups and Recovery

posted Jun 17, 2012, 4:40 PM by Tyler Akins

Years ago, I had the task to create a very large network share.  I decided to build a Linux box with 6 raided 1.5 TB drives.  At the time, it was a hefty cost.  So, when we were planning this whole thing out, it was decided that there would really be no possibility of a backup since getting tapes and building a secondary machine were both cost prohibitive.  Yes, it was a risk, but one that was acceptable. To counter the chance inevitability of a future failure, we decided to use RAID 5 and software RAID on Linux.  That way, as long as we could get five of the six drives up and could build/borrow another Linux box, we should be able to recover the data.  So, no actual backup strategy but at least some fault tolerance was built into the design.  I also reasoned that I could get the data off with a minor amount of additional hardware purchased at the time of failure.

And yesterday the machine failed.

Now, I don't have another Linux box with six SATA ports on it, so I made a trip to Microcenter and purchased some handy SATA to USB devices in order to get five drives running.  That way I could run in degraded mode and mount the filesystem as read-only so I could get the data off the drives.  I discovered that one of the things I picked was actually IDE to USB, and so I made trip #2 to Microcenter.  After that, I was wiring things together and one of the enclosures failed to work.  Trip #3.  At least they're really nice at the returns counter.

I plug in the drives into a USB hub, then plug in the hub and additional destination drives to my laptop.  I'm recovering at a mere 20 MB/s, so it will take a long time, but at least I didn't have the drives full when I started.

So, here I am, pondering the things that went well and the things that were terrible about this strategy, and I have to say that I am quite pleased with how everything is panning out.  I figure that I should give you an overview of the various pieces that were considered during building the system and how well things worked for me during this time of failure.  It might keep my mind off the fact that I'm now recovering my RAID on a hodgepodge of cabling, I've got my kids looking at the flashing lights, I'm pretty sure one of the enclosures has touchy wiring making it motion sensitive and there's a thunderstorm coming.  I wish I plugged all this into a UPS before I started.

Plan for Failure - Backups and Recovery

I knew that I'd be building a custom system that had more space on it than what was on all of the servers, NAS devices and desktops (combined) at my current place of business.  When this would fail, how would I get data off the machine?  Have a backup plan.  Mine was really to get the information again through a very long and painful process because I could not afford to double my costs.

To mitigate the chance of loss, I did decide that I'd always be able to afford one more drive to be used by the RAID for the "R" part (redundant).  I'd need at least two drives to fail for me to lose the data.

Stagger Hard Drive Purchases

When you purchase the drives for your devices, you want to get them from different batches.  This is because hard drives manufactured at the same time tend to break at about the same time.  I didn't do this either due to time constraints, but you should do what you can.

When It Fails, What Then?

Alerts were set up to monitor the drives and let me know immediately if the data was at risk.  I'd just go out and buy a new hard drive and add it to the RAID to recover.  Not a big deal... as long as the other five drives stayed running.

If my machine died, part of the recovery plan was to go out and purchase USB adapters for the drives.  At the time, those were a little expensive and they came down greatly in price.  I figured that perhaps USB 3 could be everywhere when there was a drive failure, so I could get improved recovery speeds.

Avoid Proprietary Lock-In

One big thing to avoid is setting up a hardware-based RAID array.  Yes, they offload the RAID work to some other device, but benchmarks show that it isn't very expensive computationally to use a software based RAID.  Another advantage of using a software RAID is that you can use multiple channels on the board to fetch and store information instead of passing everything through a single controller.  Lastly, you avoid proprietary RAID formats.  This last topic is a huge hurdle.

When you use a hardware RAID card, I strongly suggest you buy no fewer than two at the exact same time and confirm that they have the same firmware on them.  I've experienced and heard of people having issues recovering a RAID when they use newer cards, different models and even with minor firmware changes.  If your one controller dies, you will need a backup controller that can get the data off the RAID, otherwise you've got a lot of useless disks.

Now, compare these problems to software RAID.  If I keep a CD of the distribution I used to make the RAID, I'll be able to install it again and recover.  Plus, it is usually forward compatible with future versions of that software.  Years ago I used mdadm to set up the RAID and today I used the current mdadm version to recover the data from the drives.  No hassle at all.

Power Problems

Since you are investing all this time and energy in making a bulletproof system, you probably want to put it on a UPS to help your hardware last longer.  The local power grid goes through brownouts, power outages, spikes and has lots of noise from adjacent buildings, blenders, fluorescent lights and other computers.  A UPS stops that and conditions the power so your hardware doesn't get beaten up nearly as much.  I have a feeling that something like that fried the big computer so that it can only stay on for two minutes at a time, which is why I'm trying to recover this data with my laptop.

Test Your Backup Plan

I've worked at places where the backup job appeared to be running for months, but never actually wrote data to the disk.  We were able to recover some of the data painfully (RAID failure there as well), but it also taught us to try to restore files from our backups every now and then.  Acrobats test that their net will hold their weight before they blindly trust their lives.  Your data is depending on you; test your "safety net" backups before you rely on them.

Summary

Keep an eye on the current safety of your systems.  Set up monitoring to ensure the health of your system is consistently good.  Backups are good, redundancy is good.  Plan for failure and test your failure plans when you can.

Thankfully my drives were not full, otherwise I'd be spending abut 110 hours recovering them.  As it is, I only have perhaps another 12 hours.  The hardest part is that I'm juggling data to drives that are significantly smaller, but I would much rather have my data than try to regenerate it again!

Diablo 3 on Ubuntu Linux

posted May 18, 2012, 11:31 AM by Tyler Akins   [ updated Jun 8, 2012, 10:19 AM ]

I sunk an obsessive amount of hours into Diablo and Diablo 2.  Now Diablo 3 is newly released and I cracked under the pressure.  I don't run Windows - I use Linux.  Ubuntu 12.04 Precise Pangolin, to be ... precise.  I also have an interesting set of criteria for whatever solution I find.
  • I must not compile software.  I totally can do it, but I simply don't want to.
  • I want to use packages so when the upstream puts out a new version, I'm not left in the dust.
  • I want to be able to double-click on an icon when I'm done and Diablo 3 should launch.
First off, we're going to have to use wine to run the Windows version of Diablo 3.  The version of wine in the repository won't work, so we need to add a custom PPA.  Another PPA will upgrade the video drivers for me.

sudo add-apt-repository ppa:cheako/packages4diabloiii
sudo add-apt-repository ppa:oibaf/graphics-drivers
sudo apt-get update

Now install the updated packages.  I also installed S3TC texture compression, which may be illegal where you are.

sudo apt-get upgrade
sudo apt-get install libtxc-dxtn0

Lastly, we'll need to tweak things a bit when we run wine.  First, go download the installer.  You can just double-click on it and it will install Diablo 3 and start downloading the gigs of data.  Once you get done with the download, or at least to a place where it will let you play the game, stop it.  Edit the link to Diablo 3.  Run "gedit" and edit Desktop/Diablo III.desktop.  Inside there, you will see a line that starts with "Exec".  Add the portion in bold below to force the use of S3TC.  Keep in mind that the next thing is all on one really long line.

Exec=env WINEPREFIX="/home/fidian/.wine" force_s3tc_enable=true wine C:\\\\windows\\\\command\\\\start.exe /Unix /home/fidian/.wine/dosdevices/c:/users/Public/Desktop/Diablo\\ III.lnk

Almost done.  Now we just need to disable some security.  You have two options: run a command as root whenever you want to run Diablo 3, or you can put it in your /etc/rc.local file and have it run automatically at boot.

# Here is the command if you want to run it manually
# Just run this once in a terminal
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

# If you want to edit /etc/rc.local (as root), add this line above "exit 0"
# Edit the file with this command:  gksudo gedit /etc/rc.local
echo 0 > /proc/sys/kernel/yama/ptrace_scope


And now you can perhaps play.  I can't because the framerate is exceedingly slow, but perhaps that's just one last hurdle to getting the game to play.

Renaming Windows Network Adapter for VirtualBox

posted Mar 20, 2012, 7:25 PM by Tyler Akins   [ updated Jun 8, 2012, 10:23 AM ]

The problem with VirtualBox is that on Mac it names its virtual host-only adapters "vboxnet0" and the like.  On Windows they are called "VirtualBox Host-Only Ethernet Adapter", maybe with "#2" added at the end.  Normally this really is not a problem, but it is if you are working in a Macintosh-dominated environment and they have been using Opscode Chef,  Vagrant and VirtualBox to bundle up development environments into boxes.  These virtual machines may be scripted to enable specific networking configurations, such as making a host-only virtual ethernet adapter available to the virtual machine so your VMs can easily network to just each other.  The problem is that the default host-only adapter name changes based on your OS, so now my configuration that's stored in the box from Vagrant is expecting an adapter named "vboxnet0" and mine isn't called that at all.  Starting the VM in VirtualBox will cause problems and then Vagrant will think the install failed.

You'd think it would be as easy as just going to the network settings in the Windows control panel and then right-clicking the adapter and hitting "Rename".  No, it's unfortunately not nearly that simple.

Contributing Factors

This is a slightly more painful process because of the following wrinkles:
  1. This problem is a little more complex because this virtual ethernet adapter can be created automatically by Vagrant.  It's really a nice setup on the Mac, but because of the weird naming it can be irritating on Windows.
  2. When Vagrant kills a machine, it can optionally clean up network adapters that are no longer used.  This means you could lose your network adapter even if you do manage to call it "vboxnet0" on your Windows machine.
  3. The box files that Vagrant produces has the machine settings saved in them.  This includes the amount of RAM, the hard drive image files (see Shrinking VM Disk Images if you want yours smaller), and the networking configuration.  Vagrant can download the box during installation, so I don't want a manual step for modifying this box file before installation.
  4. The standard advice I'd find on VirtualBox forums and other places would be to always manually go in and check / change the network settings.  My goal is a fully automated solution, which means everything get scripted.
  5. I could use different configurations and double the amount of images I'd need to maintain in order to support Windows, but that's just making a problem bigger and more unmanageable.
  6. Windows XP's security model for registry changes is different than Windows 7.  More on this later.
Most of the problems can be eliminated by being able to rename a network adapter right after we create one with Vagrant.  It could be added safely to the scripts that spin up machines and people from around the world will rejoice.  Well, at least a couple might hum in a happy way.

The Solution

It turns out that the name of the network adapter, as seen by VirtualBox, is secreted away in the registry.  If you use regedit to check out HKLM\SYSTEM\CurrentControlSet\Enum\Root\NET and pick one of the keys listed there.  If you check out the Service value and it says "VBoxNetAdp", then you are in luck.  If there is a FriendlyName value just change it to "vboxnet0".  If not, make a FriendlyName value and set it to "vboxnet0".  Reboot or restart all of your VirtualBox software and you should now see this renamed network adapter.

Unfortunately, this is where we hit a snag.  On Windows XP you may need administrator privileges to set this value.  On Windows 7 you need to use the "SYSTEM" account (not the administrator account) or else you will get the wrath of the "access denied" alert.  Don't fret, I've got you covered.

Manual Process

  1. Run VirtualBox and make a virtual host-only network adapter
  2. Tie this virtual adapter to a new virtual machine
    • This is an optional step and is useful so Vagrant doesn't delete the network adapter
  3. Run regedit as administrator
  4. Browse to HKLM\SYSTEM\CurrentControlSet\Enum\Root\NET
    • This could be a specific control set if you are administering another person's account
  5. Look at the keys under here for one with a System value of "VBoxNetAdp"
  6. If there is not a FriendlyName value, right-click in the right pane and attempt to add a new string value; otherwise double-click the FriendlyName value and rename the adapter to "vboxnet0"
    • If you get an "access denied" message, grant Administrator permission to modify the key by right-clicking on the key, select Permissions -> Advanced -> Owner and grant full control to Administrators.  Apply and try to add or change the value again in regedit.
  7. "VBoxManage list hostonlyifs" from the command line should now list your new value.  If not, double-check that the FriendlyName is properly set.  Then try rebooting the machine.
Fantastic.  It's now named vboxnet0.  You could use this to rename the network adapter to anything you like if vboxnet0 doesn't tickle your fancy.

Automatic Process

If you are in a situation like where I was and you need to get this deployed to many machines, you will want to write a little script.  There's two key parts to the script - scanning and escalating.  The scanning part is pretty straightforward.  This is not real code, just in case you were wondering.

for each key in HKLM\SYSTEM\CurrentControlSet\Enum\Root\NET as key
    if value of (key + "\System") == "VBoxNetAdp"
        if value of (key + "\FriendlyName") == "vboxnet0"
            return SUCCESS // It's already there
        end
    end
next

for each key in HKLM\SYSTEM\CurrentControlSet\Enum\Root\NET as key
    if value of (key + "\System") == "VBoxNetAdp"
        if can set value of (key + "\FriendlyName") to "vboxnet0" then
            return SUCCESS // I made one exist
        else
            return FAILURE // Could not rename - maybe escalate?
        end
    end
end

return FAILURE // None detected to change

What this will do is first scan all net adapters for a VirtualBox network adapter.  If it finds one with the name "vboxnet0" it will exit since we don't need to do any work.  Failing that, it will scan again to find the first VirtualBox network adapter and attempt to rename it to vboxnet0.  This will return either success or failure.  If no VirtualBox network adapters were found, this script fails.

Next up, the escalating of privileges.  Either you can write a real program or else you can perhaps use PsExec to grant you the right privileges when running a command-line tool.

Attempt to rename as regular user

if rename_script_result == FAILURE
    Attempt to rename as Administrator

    if rename_script_result == FAILURE
        Attempt to rename as SYSTEM

        if rename_script_result == FAILURE
            return FAILURE // Could not do it
        end
    end
end

I once wrote some JavaScript to do this and executed it with cscript in Windows, though I believe this could be done better as an application that could prompt for Administrator privileges and properly drop down to SYSTEM instead of relying on PsExec trickery.  It also turns out that you can run "cmd" as Administrator, but people have a hard time running cscript as administrator in a command shell, and there is no way that I found to run a command-line tool as SYSTEM without PsExec.  I've tried my best to recreate that script from memory and attached it below.  I haven't tested it much, so I'd appreciate feedback if there's a shortcoming.

While trying to get my solution to work, I found perhaps a half dozen ways that UAC didn't work with regard to batch files and windows scripting host.  I guess that there were enough skript kiddiez out there using these tools that Microsoft needed to clamp down on the interaction between the shell and programs. I can't blame them, but it is sure hard to pop open a UAC prompt on Windows 7 from a command line; I certainly didn't find a good way.

Disabling Hyperthreading

posted Mar 12, 2012, 4:28 PM by Tyler Akins   [ updated Apr 12, 2013, 11:46 AM ]

I never thought I would find what I feel is a really bad problem with the Linux scheduler, but it's hard to argue with my results.  I have an Acer Aspire One netbook and it has an 1.5 Ghz Intel Atom N550 inside.  It is a dual-core CPU with hyperthreading enabled.

At first I thought I was crazy or that something was fundamentally broken with my recent Ubuntu install on this fine machine.  I had been used to an HP Mini 110.  It's a dual core 1 Ghz AMD processor, and I expected better performance from this one.  Instead, I had found that my programs seemed to frequently hang, really crawl slowly, or sporadically operate well.  Very odd behavior.  I found, through use of my Mad Google Skillz, that it could be due to hyperthreading on the processor.  You see, hyperthreading isn't a real processing thread.  It's more like sharing parts of the same processing unit.  While one is doing an addition, another could use the unused multiplication routine.  If they both want to use bits of the CPU that overlap, then one process just has to wait.  In my case, that starved process waited and waited and waited.  It looked like Linux thought the hyperthreading was another core and treated it as though it could safely and quickly run threads on any of the available cores.  Thus, lots of jobs were running on the first core and few were running on the seconds.  The ones sharing a real core all got stalled.

Manually Disabling Hyperthreading

I found that the Ubuntu kernel, as well as RedHat and others, compile in an option to disable use of a CPU on the fly.  Fantastic!  Running two commands as root will kill the hyperthreading on my machine.

echo 0 > /sys/devices/system/node/node0/cpu1/online
echo 0 > /sys/devices/system/node/node0/cpu3/online

Your CPU numbers may not match mine, so I don't suggest you use the above.

Now you might be wondering how you can get this to happen automatically on machines when they start up.  You can edit /etc/rc.local and add those two lines above the 'exit' line, but if the machine changes and you now don't have hyperthreading, then maybe you disabled two processors in your quad core machine.  Yikes!  Since programs are supposed to detect things like this and do work for you, I scoured the internet and tried to find a way to detect if a CPU is a hyperthreading CPU or not.  I didn't come up with anything at all.

But that didn't stop me.

Automatically Disabling Hyperthreading

I wrote this code to detect and disable all hyperthreading that looks like processors.  You can use this on any machine, whether or not it has hyperthreading and no matter how many processors and cores it actually has.  Let me know if this works well for you or what could be changed to make everyone happier.

Update 2013-03-22:  Linux 2.6.x uses a comma as a separator, so changed cut -d '-' -f 1 to a sed command.

#!/bin/bash

# Be careful to not skip the space at the beginning nor the end
CPUS_TO_SKIP=" $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sed 's/[^0-9].*//' | sort | uniq | tr "\r\n" "  ") "


for CPU_PATH in /sys/devices/system/cpu/cpu[0-9]*; do
    CPU="$(echo $CPU_PATH | tr -cd "0-9")"
    echo "$CPUS_TO_SKIP" | grep " $CPU " > /dev/null
    if [ $? -ne 0 ]; then
        echo 0 > $CPU_PATH/online
    fi
done

With the above script saved safely on my hard drive and /etc/rc.local running this shell command, I automatically disable hyperthreading just after boot... until my machine gets cloned to another netbook that doesn't have hyperthreading, and then no CPUs are disabled.  The best of both worlds.

Opscode Chef + RabbitMQ

posted Feb 25, 2012, 9:53 AM by Tyler Akins   [ updated Jun 8, 2012, 10:22 AM ]

I have the privilege of working with Opscode Chef at work, maintaining the recipes as various projects move to "the cloud" or otherwise want to script the setup of their environments.  While spinning up many new machines with knife, very rarely I will hit this problem.  After trying pretty hard to find the root cause and why it happens on the internet, I'll sum up what I found on to this one page.  Maybe it will help solve this problem for you too?

Symptoms

I was running knife and it wasn't doing anything at all.  No CPU activity and no interaction with anything else.  No error messages popped up.  No output to console at all after hitting enter to start the command. For all intents, it appeared that the program was hung.  I didn't run strace on it, but I did attempt to use kill and kill -9 on the process to no effect.

I really hate when things like that happen.

At this point, the server started acting funny and I couldn't spin up any instances, so I rebooted.

Try #2

So I tried to spin up the server with knife again and got some interesting error messages:

INFO: *** Chef 0.10.4 ***
INFO: Client key /etc/chef/client.pem is not present - registering
INFO: HTTP Request Returned 500 Internal Server Error: Connection refused - connect(2)
ERROR: Server returned error for http://bluemoon.fuf.me:4000/clients, retrying 1/5 in 4s
INFO: HTTP Request Returned 409 Conflict: Client already exists
INFO: HTTP Request Returned 403 Forbidden: You are not allowed to take this action.
FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
FATAL: Net::HTTPServerException: 403 "Forbidden"

Uh-ho.  Why would the server say that the client already exists?  Let's go see what the logs say.  I browse /var/log/chef/server.log and find this one gem of a line, which I quoted below.  It was not at the bottom of the file; I needed to scroll up several screens of logs in order to see this problem.

merb : chef-server (api) : worker (port 4000) ~ Connection refused - connect(2) - (Bunny::ServerDownError)

What is this Bunny and why is it down?  Turn out that RabbitMQ will not start because it has corrupted databases.  Bummer.  You may ask "how can I fix such a thing?"  Well, you can't really repair the databases.  Instead, we just delete them.

The Fix

When you delete the RabbitMQ databases (conveniently located in /var/lib/rabbitmq/mnesia), you are not done yet.  In order to set up Chef in RabbitMQ, you need to add a vhost, username, password, and permissions.  The vhost is /chef, username is chef, and the user should have all permissions to the vhost.  The slightly tougher part is getting the password, but it's found in /etc/chef/solr.rb as amqp_pass.  Here is a shell script I used to fix the problem.  You're welcome to use it.

#!/bin/bash

# Fix RabbitMQ by removing the databases

service rabbitmq-server stop

if [ -d /var/lib/rabbitmq/mnesia ]; then
echo Removing mnesia directory
rm -r /var/lib/rabbitmq/mnesia -r
fi

service rabbitmq-server start


# Add the Chef vhost, username, password, and permissions

rabbitmqctl add_vhost /chef
PASS=$( grep ^amqp_pass /etc/chef/solr.rb | cut -d '"' -f 2 );
rabbitmqctl add_user chef $PASS
rabbitmqctl set_permissions -p /chef chef ".*" ".*" ".*"

1-10 of 11