How I Do Encrypted, Mirrored ZFS Root on Linux

Update: No more keyfiles!

I am done with Solaris. A quick look through this blog should be enough to see how much I like Solaris. So when I say “I’m done,” I want to be perfectly clear as to why: Oracle. As long as Oracle continues to keep Solaris’ development and code under wraps, I cannot feel comfortable using it or advocating for it, and that includes at work, where we are paying customers. I stuck with it up until now, waiting for a better alternative to come about, and now that ZFS is stable on Linux, I’m out.

I’ve returned to my first love, Gentoo. Well, more specifically, I’ve landed in Funtoo, a variant of Gentoo. I learned almost everything I know about Linux on Gentoo, and being back in its ecosystem feels like coming back home. Funtoo, in particular, addresses a lot of the annoyances that made me leave Gentoo in the first place by offering more stable packages of core software like the kernel and GCC. Its Git-based Portage tree is also a very nice addition. But it was Funtoo’s ZFS documentation and community that really got my attention.

The Funtoo ZFS install guide is a nice starting point, but my requirements were a bit beyond the scope of the document. I wanted:

  • redundancy handled by ZFS (that is, not by another layer like md),
  • encryption using a passphrase, not a keyfile,
  • and to be prompted for the passphrase once, not for each encrypted device.

My solution is depicted below:

ZFS Root Diagram

A small block device is encrypted using a passphrase. The randomly initialized contents of that device are then in turn used as a keyfile for unlocking the devices that make up the mirrored ZFS rpool. Not pictured is Dracut, the initramfs that takes care of assembling the md RAID devices, unlocking the encrypted devices, and mounting the ZFS root at boot time.

Here is a rough guide for doing it yourself:

  1. Partition the disks.
    Without going in to all the commands, use gdisk to make your first disk look something like this:

    # gdisk -l /dev/sda
    ...
    Number  Start (sector)    End (sector)  Size       Code  Name
       1            2048         1026047   500.0 MiB   FD00  Linux RAID
       2         1026048         1091583   32.0 MiB    EF02  BIOS boot partition
       3         1091584         1099775   4.0 MiB     FD00  Linux RAID
       4         1099776       781422734   372.1 GiB   8300  Linux filesystem
    

    Then copy the partition table to the second disk:

    # sgdisk --backup=/tmp/table /dev/sda
    # sgdisk --load-backup=/tmp/table /dev/sdb
    # sgdisk --randomize-guids /dev/sdb
    

    If your system uses EFI rather than BIOS, you won’t need a BIOS boot partition, so adjust your partition numbers accordingly.

  2. Create the md RAID devices for /boot and the keyfile.
    # mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
    # mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3
    
  3. Set up the crypt devices.
    First, the keyfile:

    # cryptsetup -c aes-xts-plain64 luksFormat /dev/md1
    # cryptsetup luksOpen /dev/md1 keyfile
    # dd if=/dev/urandom of=/dev/mapper/keyfile
    

    Then, the ZFS vdevs:

    # cryptsetup -c aes-xts-plain64 luksFormat /dev/sda4 /dev/mapper/keyfile
    # cryptsetup -c aes-xts-plain64 luksFormat /dev/sdb4 /dev/mapper/keyfile
    # cryptsetup -d /dev/mapper/keyfile luksOpen /dev/sda4 rpool-crypt0
    # cryptsetup -d /dev/mapper/keyfile luksOpen /dev/sdb4 rpool-crypt1
    
  4. Format and mount everything up.
    # zpool create -O compression=on -m none -R /mnt/funtoo rpool mirror rpool-crypt0 rpool-crypt1
    # zfs create rpool/ROOT
    # zfs create -o mountpoint=/ rpool/ROOT/funtoo
    # zpool set bootfs=rpool/ROOT/funtoo rpool
    # zfs create -o mountpoint=/home rpool/home
    # zfs create -o volblocksize=4K -V 2G rpool/swap
    # mkswap -f /dev/zvol/rpool/swap
    # mkfs.ext2 /dev/md0
    # mkdir /mnt/funtoo/boot && mount /dev/md0 /mnt/funtoo/boot
    

    Now you can chroot and install Funtoo as you normally would.

When it comes time to finish the installation and set up the Dracut initramfs, there is a number of things that need to be in place. First, the ZFS package must be installed with the Dracut module. The current ebuild strips it out for some reason. I have a bug report open to fix that.

Second, /etc/mdadm.conf must be populated so that Dracut knows how to reassemble the md RAID devices. That can be done with the command mdadm --detail --scan > /etc/mdadm.conf.

Third, /etc/crypttab must be created so that Dracut knows how to unlock the encrypted devices:

keyfile /dev/md1 none luks
rpool-crypt0 /dev/sda4 /dev/mapper/keyfile luks
rpool-crypt1 /dev/sdb4 /dev/mapper/keyfile luks

Finally, you must tell Dracut about the encrypted devices required for boot. Create a file, /etc/dracut.conf.d/devices.conf containing:

add_device="/dev/md1 /dev/sda4 /dev/sdb4"

Once all that is done, you can build the initramfs using the command dracut --hostonly. To tell Dracut to use the ZFS root, add the kernel boot parameter root=zfs. The actual filesystem it chooses to mount is determined from the zpool’s bootfs property, which was set above.

And that’s it!

Now, I go a little further by creating a set of Puppet modules to do the whole thing for me. Actually, I can practically install a whole system from scratch with one command thanks to Puppet.

I also have a script that runs after boot to close the keyfile device. You’ve got to protect that thing.

# cat /etc/local.d/keyfile.start
#!/bin/sh
/sbin/cryptsetup luksClose keyfile

I think one criticism that could be leveled against this setup is that all the data on the ZFS pool gets encrypted and decrypted twice. That is because the redundancy comes in at a layer higher than the crypt layer. A way around that would be to set it up the other way around: encrypt a md RAID device and build the ZFS pool on top of that. Unfortunately, that comes at the cost of ZFS’s self healing capabilities. Until encryption support comes to ZFS directly, that’s the trade-off we have to make. In practice, though, the double encryption of this setup doesn’t make a noticeable performance impact.

UPDATE

I should mention that I’ve learned that Dracut is much smarter than I would have guessed, and it will let you enter a passphrase once and it will try it on all of the encrypted devices. This eliminates the need for the keyfile in my case, so I’ve updated all of my systems to simply use the same passphrase on all of the encrypted devices. I have found it to be a simpler and more reliable setup.

Customizing the OpenStack Keystone Authentication Backend

OpenStack Login For those of you unfamiliar with OpenStack, it is a collection of many independent pieces of cloud software, and they all tie into Keystone for user authentication and authorization. Keystone uses a MySQL database backend by default, and has some support for LDAP out-of-the-box. But what if you want to have it authenticate against some other service? Fortunately, the Keystone developers have already created a way to do that fairly easily; however, they haven’t documented it yet. Here’s how I did it:

  1. Grab the Keystone source from GitHub and checkout a stable branch:
    % git clone git://github.com/openstack/keystone.git
    % cd keystone
    % git checkout stable/grizzly
    
  2. Since we still want to use the MySQL backend for user authorization, we will extend the default identity driver, keystone.identity.backends.sql.Identity, and simply override the password checking function. Create a new file called keystone/identity/backends/custom.py containing:
    from __future__ import absolute_import
    import pam
    from . import sql

    class Identity(sql.Identity):
        def _check_password(self, password, user_ref):
            username = user_ref.get('name')
           
            if (username in ['admin', 'nova', 'swift']):
                return super(Identity, self)._check_password(password, user_ref)
           
            return pam.authenticate(username, password)

    In this snippet, we check the username and password against PAM, but that can be anything you want (Kerberos, Active Directory, LDAP, a flat file, etc.). If the username is one of the OpenStack service accounts, then the code uses the normal Keystone logic and checks it against the MySQL database.

  3. Build and install the code:
    % python setup.py build
    % sudo python setup.py install
    
  4. Configure Keystone to use the custom identity driver. In /etc/keystone/keystone.conf add or change the following section:
    [identity]
    driver = keystone.identity.backends.custom.Identity
  5. Start Keystone (keystone-all) and test, then save the changes to the Keystone source:
    % git add keystone/identity/backends/custom.py
    % git commit -m "Created custom identity driver" -a
    

And that’s it. In reality, I would probably fork the Keystone repository on GitHub and create a new branch for this work (git checkout -b customauth stable/grizzly), but that’s not really necessary. Actually, you could probably even get away with not recompiling Keystone. Just put the custom class somewhere in Keystone’s PYTHONPATH. But I’m not a Python expert, so maybe that wouldn’t work. Either way, I like having everything together, and Git makes it brainless to maintain customizations to large projects.

Benchmarking Duplog

In my last post I introduced Duplog, a small tool that basically forms the glue between rsyslog, RabbitMQ, Redis, and Splunk to enable a highly available, redundant syslog service that deduplicates messages along the way. In order for Duplog to serve the needs of a production enterprise, it will need to perform well. I fully expect the deduplication process to take a toll, and in order to find out how much, I devised a simple benchmark.

Duplog Benchmark Diagram

One one side, fake syslog generators pipe pairs of duplicate messages into the system as fast as they can. On the other, a process reads the messages out as fast as it can. Both sides report the rate of message ingestion and extraction.

The Details

  • 3 KVM virtual machines on the same host each with the same specs:
    • 1 x Intel Core i5 2400 @ 3.10 GHz
    • 1 GB memory
    • Virtio paravirtualized NIC
  • OS: Ubuntu Server 12.04.1 LTS 64-bit
  • RabbitMQ: 2.7.1
  • Redis: 2.2.12
  • Java: OpenJDK 6b27

The Results

(My testing wasn’t very scientific so take all of this with a grain of salt.)

Duplog Benchmark

The initial results were a little underwhelming, pushing about 750 messages per second through the system. I originally expected that hashing would be the major bottleneck, or the communication with the Redis server, but each of those processes were sitting comfortably on the CPU at about 50% and 20% usage, respectively. It turned out that the RabbitMQ message brokers were the source of the slow performance.

I began trying many different settings for RabbitMQ, starting by disabling the disk-backed queue, which made little difference. In fact, the developers have basically said as much: “In the case of a persistent message in a durable queue, yes, it will also go to disk, but that’s done in an asynchronous manner and is buffered heavily.”

So then I changed the prefetch setting. Rather than fetching and acknowledging one message at a time, going back and forth over the network each time, the message consumers can buffer a configurable number of messages at a time. It is possible to calculate the optimum prefetch count, but without detailed analytics handy, I just picked a prefetch size of 100.

That setting made a huge difference, as you can see in the histogram. Without spending so much time talking to RabbitMQ, consumers were free to spend more time calculating hashes, and that left the RabbitMQ message brokers more free to consume messages from rsyslog.

Another suggestion on the internet was to batch message acknowledgements. That added another modest gain in performance.

Finally, I tried enabling an unlimited prefetch count size. It is clear that caching as many messages as possible does improve performance, but it comes at the cost of fairness and adaptability. Luckily, neither of those characteristics is important for this application, so I’ve left it that way, along with re-enabling queue durability, whose performance hit, I think, is a fair trade-off for message persistence. I also reconfigured acknowledgements to fire every second rather than every 50 messages. Not only does that guarantee that successfully processed messages will get acknowledged sooner or later, it spaces out ACKs even more under normal operation, which boosted the performance yet again to around 6,000 messages per second.

So is 6,000 messages per second any good? Well, if you just throw a bunch of UDP datagrams at rsyslog (installed on the same servers as above), it turns out that it can take in about 25,000 messages per second without doing any processing. It is definitely reasonable to expect that the additional overhead of queueing and hashing in Duplog will have a significant impact. It is also important to note that these numbers are sustained while RabbitMQ is being written to and read from simultaneously. If you slow down or stop message production, Duplog is able to burst closer to 10,000 messages per second. The queueing component makes the whole system fairly tolerant of sudden spikes in logging.

For another perspective, suppose each syslog message averages 128 bytes (a reasonable, if small, estimate), then 6,000 messages per second works out to 66 GB per day. For comparison, I’ve calculated that all of the enterprise Unix systems in my group at UMD produce only 3 GB per day.

So as it stands now I think that Duplog performs well enough to work in many environments. I do expect to see better numbers on better hardware. I also think that there is plenty of optimization that can still be done. And in the worst case, I see no reason why this couldn’t be trivially scaled out: divide messages among multiple RabbitMQ queues to better take advantage of SMP. This testing definitely leaves me feeling optimistic.

Overengineering Syslog: Redundancy, High Availability, Deduplication, and Splunk

I am working on a new Splunk deployment at work, and as part of that project, I have to build a centralized syslog server. The server will collect logs from all of our systems and a forwarder will pass them along to Splunk to be indexed. That alone would be easy enough, but I think that logs are too important to leave to just one syslog server. Sending copies of the log data to two destinations may allow you to sustain outages in half of the log infrastructure while still getting up-to-the-minute logs in Splunk. I think duplicating log messages at the source is a fundamental aspect of a highly available, redundant syslog service when using the traditional UDP protocol.

That said, you don’t want to have Splunk index all of that data twice. That’ll cost you in licenses. But you also don’t want to just pick a copy of the logs to index—how would you know if the copy you pick is true and complete? Maybe the other copy is more complete. Or maybe both copies are incomplete in some way (for example, if routers were dropping some of those unreliable syslog datagrams). I think the best you can do is to take both copies of the log data, merge them together somehow, remove the duplicate messages, and hope that, between the two copies, you’re getting the complete picture.

I initially rejected the idea of syslog deduplication thinking it to be too complicated and fragile, but the more I looked into it, the more possible it seemed. When I came across Beetle, a highly available, deduplicating message queue, I knew it would be doable.

Beetle Architecture

Beetle Architecture

Beetle itself wouldn’t work for what I had in mind (it will deduplicate redundant messages from single sources; I want to deduplicate messages across streams from multiple sources), but I could take its component pieces and build my own system. I started hacking on some code a couple of days ago to get messages from rsyslog to RabbitMQ and then from RabbitMQ to some other process which could handle deduplication. It quickly turned into a working prototype that I’ve been calling Duplog. Duplog looks like this:

Duplog Architecture

Duplog Architecture

At its core, Duplog sits and reads messages out of redundant RabbitMQ queues, hashes them, and uses two constant-time Redis operations to deduplicate them. RabbitMQ makes the whole process fairly fault tolerant and was a great discovery for me (I can imagine many potential use cases for it besides this). Redis is a very flexible key-value store that I’ve configured to act as a least-recently-used cache. I can throw hashes at it all day and let it worry about expiring them.

One important design consideration for me was the ability to maintain duplicate messages within a single stream. Imagine you have a high-traffic web server. That server may be logging many identical HTTP requests at the same time. Those duplicates are important to capture in Splunk for reporting. My deduplication algorithm maintains them.

Looking at the architecture again, you will see that almost everything is redundant. You can take down almost any piece and still maintain seamless operation without dealing with failovers. The one exception is Redis. While it does have some high availability capability, it relies on failover which I don’t like. Instead, I’ve written Duplog to degrade gracefully. When it can’t connect to Redis, it will allow duplicate messages to pass through. A few duplicate messages isn’t the end of the world.

Feel free to play around with the code, but know that it is definitely a prototype. For now I’ve been working with the RabbitMQ and Redis default settings, but there is probably a lot that should be tuned, particularly timeouts, to make the whole system more solid. I also intend to do some benchmarking of the whole stack (that’ll probably be the next post), but initial tests on very modest hardware indicate that it will handle thousands of messages per second without breaking a sweat.

Protecting Puppet with Kerberos

Puppet uses bidirectional SSL to protect its client-server communication. All of the participants in a Puppet system must have valid, signed certificates and keys to talk to one another. This prevents agents from talking to rogue masters and it prevents nodes from spoofing one another. It also allows the master and agents to establish secure communication channels to prevent eavesdropping. Puppet comes with a built-in certificate authority (CA) to make the management of all the keys, certs, and signing requests fairly easy.

But what if you already have a large, established Kerberos infrastructure? You’re probably already generating and managing keys for all of your trusted hosts. Wouldn’t it be great to leverage your existing infrastructure and established processes instead of duplicating that effort with another authentication system?

Enter kx509. kx509 is a method for generating a short-lived X.509 (SSL) certificate from a valid Kerberos ticket. Effectively, a client can submit its Kerberos ticket to a trusted Kerberized CA (KCA), which then copies the principal name into the subject field of a new X.509 certificate and signs it with its own certificate. There’s really no trickery to it: if you trust the Kerberos ticket, and you trust the KCA, then you can trust the certificate generated by it. Sounds great, except that there is virtually no documentation on kx509, and even when you do get it running, there are a couple of issues that prevent it from working with Puppet out-of-the-box.

I wanted to figure out how to get it working with the least number of changes possible. To do this, I set up my own clean Kerberos and Puppet environment in a couple of VMs (a client and a server). I am documenting the whole process here for my own benefit, but maybe it will be useful to others.

The Setup

  • 2 virtual machines:
    • server.example.com (192.168.100.2)
    • client.example.com (192.168.100.3)
  • OS: Ubuntu Server 12.04.1 LTS 64-bit
  • Kerberos: Heimdal 1.5.2
  • Puppet: 2.7.11

I also set up a DNS server containing entries for the two hosts. Kerberos and Puppet are a lot easier to work with when they can use DNS, and it will be required for the kx509 stuff which we’ll see later.

I chose Heimdal because it has a built in KCA (and that’s what we run at work).

Continue reading

Mounting Encrypted ZFS Datasets at Boot

When ZFS encryption was released in Solaris 11 Express, I went out and bought four 2 TB drives and moved all of my data to a fresh, fully-encrypted zpool. I don’t keep a lot of sensitive data, but it brings me peace of mind to know that, in the event of theft or worse, my data is secure.

I chose to protect the data keys using a passphrase as opposed to using a raw key on disk. In my opinion, the only safe key is one that’s inside your head (though the US v. Fricosu case has me reevaluating that). The downside is that Solaris will ignore passphrase-encrypted datasets at boot.

The thing is, I run several services that depend on the data stored in my encrypted ZFS datasets. When Solaris doesn’t mount those filesystems at boot, those services fail to start or come up in very weird states that I must recover from manually. I would rather pause the boot process to wait for me to supply the passphrase so those services come up properly. Fortunately this is possible with SMF!

All of the services I am concerned about depend on, in one way or another, the svc:/system/filesystem/local:default service, which is responsible for mounting all of the filesystems. That service, in turn, depends on the single-user milestone. So I just need to inject my own service between the single-user milestone and the system/filesystem/local service that fails when it doesn’t have the keys. That failure will pause the boot process until it is cleared.

I wrote a simple manifest that expresses the dependencies between single-user and system/filesystem/local:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle type='manifest' name='nest'>

<service
   name='system/filesystem/nest'
   type='service'
   version='1'>

    <create_default_instance enabled='true' />
    <single_instance />

    <dependency
       name='single-user'
       grouping='require_all'
       restart_on='none'
       type='service'>
        <service_fmri value='svc:/milestone/single-user' />
    </dependency>

    <dependent
       name='nest-local'
       grouping='require_all'
       restart_on='none'>
        <service_fmri value='svc:/system/filesystem/local' />
    </dependent>

    <exec_method
       type='method'
       name='start'
       exec='/lib/svc/method/nest start'
       timeout_seconds='60' />

    <exec_method
       type='method'
       name='stop'
       exec=':true'
       timeout_seconds='60' />

    <property_group name='startd' type='framework'>
        <propval name='duration' type='astring' value='transient' />
    </property_group>

    <stability value='Unstable' />

    <template>
        <common_name>
            <loctext xml:lang='C'>Load key for 'nest' zpool</loctext>
        </common_name>
    </template>
</service>

</service_bundle>

and a script at /lib/svc/method/nest that gets called by SMF:

#!/sbin/sh

. /lib/svc/share/smf_include.sh

case "$1" in
    'start')
        if [ $(zfs get -H -o value keystatus nest) != "available" ]; then
            echo "Run '/usr/sbin/zfs key -lr nest && /usr/sbin/svcadm clear $SMF_FMRI'" | smf_console
            exit $SMF_EXIT_ERR_FATAL
        fi
        ;;

    *)
        echo "Usage: $0 start"
        exit $SMF_EXIT_ERR_CONFIG
        ;;
esac

exit $SMF_EXIT_OK

The script checks whether the keys are available, and if not, prints a helpful hint to the console. The whole thing looks something like this at boot:

SunOS Release 5.11 Version 11.0 64-bit
Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
Hostname: falcon

Run '/usr/sbin/zfs key -lr nest && /usr/sbin/svcadm clear svc:/system/filesystem/nest:default'
May 30 14:31:06 svc.startd[11]: svc:/system/filesystem/nest:default: Method "/lib/svc/method/nest start" failed with exit status 95.
May 30 14:31:06 svc.startd[11]: system/filesystem/nest:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)

falcon console login: jlee
Password: 
falcon% sudo -s
falcon# /usr/sbin/zfs key -lr nest && /usr/sbin/svcadm clear svc:/system/filesystem/nest:default
Enter passphrase for 'nest': 
falcon#

When I get to the console shell, I can just copy and paste the command printed by the script. Once the service failure is cleared, SMF continues the boot process normally and all of my other services come up exactly as I’d expect.

No, it’s not very pretty, but I’d rather have a little bit of manual intervention during the boot process for as infrequently as I do it, than to have to clean up after services that come up without the correct dependencies. And with my new homemade LOM, it’s not too much trouble to run commands at the console, even remotely.

My Homemade LOM

In one of the final classes of my CS master’s program, Embedded Computing, we were required to complete a semester project of our choosing involving embedded systems. Like in previous semester projects, I wanted to do something that I would actually be able to use after the class ended.

This time around I chose to build a lights-out manager for my Sun Ultra 24 server (which this blog is hosted on). With a LOM I can control the system’s power and access its serial console so I will be able to perform OS updates remotely, among other things.

Since I already owned one and I didn’t want to spend a lot of money, I chose to develop the project on top of the Arduino platform. I like the size of the Arduino, and the availability of different shields to minimize soldering. It’s also able to be powered by USB, which is perfect because the Ultra 24 has an internal USB port that always supplies power.

To make things a little more challenging for myself (because the Arduino is pretty easy on its own), I chose to implement a hardware UART to communicate with the Ultra 24’s serial port. Specifically, I chose to use the Maxim MAX3110E SPI UART and RS-232 transceiver. Great little chip.

For communication with the outside world, I bought an Ethernet shield from Freetronics. It’s compatible with the official Arduino Ethernet shield, but includes a fix to allow the network module to work with other SPI devices (such as my UART) on the same bus. I started to implement the network UI using Telnet, but after realizing I would have to translate the serial console data from VT100 to NVT, I switched to Rlogin, which is like Telnet, but assumes like-to-like terminal types.

Lastly, for controling the system’s power, I figured out how to tap into the Ultra 24’s power LED and switch. Using the LED, I can check whether the system is on or off, and using the switch circuit and a transistor, I can power the system on and off. I managed to do this without affecting the operation of the front panel buttons/LEDs.

I’ll spare you all of the implementation details (if you’re interested, you can read my report). Suffice it to say, the thing works as well as I could have imagined. Here is a screenshot of me using the serial console on my workstation:

From my research, the serial and power motherboard headers are the same on most modern Intel systems, so this LOM should work on more than just an Ultra 24. If you want to build one of your own, my code is available on GitHub and the hardware schematic is in the report I linked to above.

The Solaris 11 Experience So Far

I have a system (a zone on which this blog is hosted) that has been running the same installation of Solaris since 11/11/2009, starting with OpenSolaris 2009.06. In the time since, it has seen every public build of OpenSolaris, then OpenIndiana, and finally Solaris 11 Express. Now, exactly two years later, I’ve updated it to Solaris 11 11/11, and I’d like to share my experience so far.

The update itself did not go smoothly. I was sitting at Solaris 11 Express SRU 8 and thought, like every update I’ve done in the past, that I could just run pkg image-update. Silly me, because when I did and then rebooted, the kernel panicked. No big deal, that’s what boot environments are for. I reverted to the previous boot environment and found some helpful documentation that told me to do exactly what I just did. It turns out that there is no way to update to SRU 13 using the support repositories because they already contain the Solaris 11 11/11 packages, and pkg tries to pull some of them in. And there is no way to update just pkg because the ips-consolidation prevents it, and trying to update the ips-consolidation pulls the entire package which breaks everything just the same. In short, Oracle bungled it. The only way to update to SRU 13 that I could see was to download the SRU 13 repository ISO from My Oracle Support and set up a local repository. Once I was on SRU 13, I could continue with the update to the 11/11 release. But there were more surprises in store for me.

First, it looks like pkg decided to start enforcing consistent attributes on files shared by multiple packages. Fine, I can understand that. As a result, I had to remove a lot of my custom packages (mostly from spec-files-extra) which I’ll have to rebuild. Second, pkg decided it doesn’t like the opensolaris.org packages anymore so I had to uninstall OpenOffice.org. Also fair enough.

Happily, after that, the updates got applied successfully and the system rebooted into the 11/11 release. Next came the zone updates. When I did the normal zoneadm -z foo detach && zoneadm -z foo attach -u deal, I was told I had to convert my zones to a new ZFS structure which more closely matches the global zone. The script /usr/lib/brand/shared/dsconvert actually worked flawlessly and the updated zones came up fine.

Unfortunately I couldn’t SSH into my zones because my DNS server didn’t know where they were. It seems that with the updated networking framework, DHCP doesn’t request a hostname anymore. (/etc/default/dhcpagent still says inet <hostname> can be put in /etc/hostname.<if> to request the hostname.) I found that you can create an addr object that requests a hostname with ipadm create-addr -T dhcp -h <hostname> <addrobj>, but NWAM pretty much won’t let you create or modify anything with ipadm, and there were no options for requesting hostnames with nwamcfg. As a result, I had to disable NWAM (netadm enable -p ncp DefaultFixed) and then I could set up the interface with ipadm. Why doesn’t Solaris request hostnames by default? Not very “cloud-like” if you ask me.

I have to say, I’m impressed by the way global zones and non-global zones are linked in the new release. Zone updates were an obvious shortcoming of previous releases. We’ll see how well it works when Solaris 11 Update 1 comes out.

What else…I lost my ability to pfexec to root. Oracle removed the “Primary Administrator” profile for security reasons so I had to install sudo. Not a big deal, I just wish they had said something a little louder about it.

Also, whatever update to pkg happened, it wiped out my repositories under /var/pkg. I had to restore them from a snapshot. Bad Oracle!

I’m also a little confused about some of the changes to the way networking settings are stored. For example, when I first booted the global zone, I found that my NFSv4 domain name was reset by NWAM. I set it to what it should be with sharectl set -p nfsmapid_domain=thestaticvoid.com nfs, but is that going to be overwritten again by NWAM? Also, the name resolver settings are now stored in the svc:/network/dns/client:default service, and according to the documentation, DHCP will set the service properties properly, but I have yet to see this work.

And the last problem I’ll mention is that the update removed my virtual consoles. I had to install the virtual-console package to restore them.

Overall, I’m happy that I was at least able to update to the latest release. Oracle could have cut off any update path from OpenSolaris. However, the update should have been a lot smoother. It doesn’t speak well of future updates when I can’t even update from one supported release (SRU 8) to another. I also wish Oracle were more open about upcoming changes (as in, having more preview releases or, dare I say it, opening development the way OpenSolaris was). Even to me, a long time pre-Solaris 11 user, the changes to zones and networking are huge in this release, and I would rather have not been so surprised by them.

Automounting NFSv4 over SSH

For the past couple of years, I’ve used SSHFS to access my fileserver remotely (mostly from work). It’s always been pretty slow and it isn’t very stable on Solaris, so I’ve switched to NFSv4 over SSH. My biggest hangup of using NFS was how to secure it over the internet. Its Kerberos support is completely overkill for my needs and I never really wanted to deal with the complications of scripting the set up of an SSH tunnel, either. It all seemed so fragile.

Then I discovered autossh which does all the work of setting up and maintaining the tunnel for me. I coupled that with an executable autofs map to automatically start the tunnel just before trying to mount a share, like:

#!/bin/bash

export AUTOSSH_PIDFILE=/var/run/falcon-tunnel.pid
export AUTOSSH_GATETIME=0
export AUTOSSH_DEBUG=1

if [ -f $AUTOSSH_PIDFILE ]; then
    kill -HUP $(cat $AUTOSSH_PIDFILE)
else
    autossh -f -M 0 -o ServerAliveInterval=5 -NL 2050:localhost:2049 jlee@falcon
fi

echo "-fstype=nfs4,port=2050 localhost:/nest/$1"

Using an executable autofs map allows me to avoid reconciling the differences between service managers like SMF and Upstart, offering a consistent way to start the tunnel exactly when it’s needed on both Solaris and Linux. When you ‘cd’ into a directory managed by autofs, autossh is started or woken up, then the share is mounted over the tunnel. If there is a network interruption or change (from wired to wireless, for example), ssh will disconnect after 15 seconds of inactivity and autossh will restart it. NFS is smart enough to resume its operation when the tunnel is reestablished.

autossh has built-in support for heartbeat monitoring, but I’ve found SSH’s built-in ServerAliveInterval feature to be more reliable.

With this setup I have very simple, robust, and secure remote access to my fileserver.

Persistent Search Domains With NWAM and DHCP

What I Want

I want to be able to refer to systems on both my home and work networks by their hostnames rather than their fully-qualified domain names, so, ‘prey’ instead of ‘prey.thestaticvoid.com’ and ‘acad2’ instead of ‘acad2.es.gwu.edu’.

NWAM Settings

The Problem

I would typically set my home and work domains as the search setting in /etc/resolv.conf. Unfortunately, either NWAM or the Solaris DHCP client (I haven’t decided which) overwrites resolv.conf on every new connection. DHCP on Linux does the same thing, but I can configure it by editing dhclient.conf (or whatever is being used these days, it’s been a while. I think I just set my domains in the NetworkManager GUI and forget about it).

The Solaris DHCP client configuration is not nearly as flexible, and neither is NWAM which gives you the option of replacing resolv.conf with information supplied by the DHCP server, or provided by you, but not a mix of both. I do like having the nameservers set by the DHCP server, so supplying a manual configuration is not an option.

What I Tried

The first thing I tried was setting the LOCALDOMAIN environmental variable in /etc/profile. From the resolv.conf man page:

You can override the search keyword of the system
resolv.conf file on a per-process basis by setting the
environment variable LOCALDOMAIN to a space-separated list
of search domains.

I thought, great, a way to manage domain search settings without worrying about what’s doing what to resolv.conf. It didn’t work as advertised:

% LOCALDOMAIN=thestaticvoid.com ping prey
ping: unknown host prey
% s touch /etc/resolv.conf
% LOCALDOMAIN=thestaticvoid.com ping prey
prey is alive
% LOCALDOMAIN=thestaticvoid.com ping prey
ping: unknown host prey

Next, I considered adding an NWAM Network Modifier to set my search string in resolv.conf after a new connection is established. This worked reasonably well, but didn’t handle the case when you switch from one network to another, for example, from wireless to wired. The only events in NWAM that can trigger a script when the network connection changes happens before DHCP messes up resolv.conf.

Finally, in the course of my testing, I discovered that the svc:/network/dns/client service was restarting with every network connection change. I looked into its manifest and saw that it was designed to wait for changes to resolv.conf:

<!--
 Wait for potential DHCP modification of resolv.conf.
-->
<dependency
   name='net'
   grouping='require_all'
   restart_on='none'
   type='service'>
    <service_fmri value='svc:/network/service' />
</dependency>

So I could write another service which depends on dns/client and restarts whenever dns/client does and I would have the last word about what goes into my configuration file!

My Solution

I wrote a service, svc:/network/dns/resolv-conf, with the following manifest:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle type="manifest" name="dns-resolv-conf">
    <service name="network/dns/resolv-conf"
        type="service"
        version="1">
        <create_default_instance enabled="false" />
        <single_instance />

        <dependency name="dns-client"
            grouping="require_all"
            restart_on="restart"
            type="service">
            <service_fmri value="svc:/network/dns/client" />
        </dependency>

        <dependent name="resolv-conf"
            grouping="optional_all"
            restart_on="restart">
            <service_fmri value="svc:/milestone/name-services" />
        </dependent>

        <exec_method type="method"
            name="start"
            exec="/lib/svc/method/dns-resolv-conf start"
            timeout_seconds="60" />

        <exec_method type="method"
            name="stop"
            exec="/lib/svc/method/dns-resolv-conf stop"
            timeout_seconds="60" />

        <property_group name="options" type="application">
            <propval name="search" type="astring" value="" />
        </property_group>

        <property_group name="startd" type="framework">
            <propval name="duration" type="astring" value="transient" />
        </property_group>

        <stability value="Unstable" />

        <template>
            <common_name>
                <loctext xml:lang="C">resolv.conf Settings</loctext>
            </common_name>
            <documentation>
                <manpage title="resolv.conf" section="4"
                    manpath="/usr/share/man" />
            </documentation>
        </template>
    </service>
</service_bundle>

which calls the script, /lib/svc/method/dns-resolv-conf containing:

#!/sbin/sh

. /lib/svc/share/smf_include.sh

search=$(svcprop -p options/search $SMF_FMRI)

case "$1" in
    "start")
        # Don't do anything if search option not provided.
        [ "$search" == '""' ] && exit $SMF_EXIT_OK

        # Reverse the lines because we either want to:
        #   add the search line after the *last* domain line or
        #   add it to the very top of the file if there is no domain line
        tac /etc/resolv.conf | grep -v "^search" | gawk '
            /^domain/ {
                if (!isset) {
                    print "search", $2, search
                    isset=1
                }
            }

            END {
                if (!isset) {
                    print "search", search
                }
            }

            1
        '
search="$search" | tac > /etc/resolv.conf.new && mv -f /etc/resolv.conf.new /etc/resolv.conf
        ;;

    "stop")
        # Just get rid of any search lines, I guess.
        grep -v "^search" /etc/resolv.conf > /etc/resolv.conf.new && mv -f /etc/resolv.conf.new /etc/resolv.conf
        ;;

    *)
        echo "Usage: $0 { start | stop }"
        exit $SMF_EXIT_ERR_CONFIG
esac

exit $SMF_EXIT_OK

So now I can set my search options like:

% svccfg -s resolv-conf setprop 'options/search="thestaticvoid.com es.gwu.edu"'
% svcadm refresh resolv-conf
% svcadm enable resolv-conf
% cat /etc/resolv.conf
domain  iss.gwu.edu
search iss.gwu.edu thestaticvoid.com es.gwu.edu
nameserver  161.253.152.50
nameserver  128.164.141.12

Problem solved! Or at least worked-around in the least hacky way I can!