The Solaris 11 Experience So Far

I have a system (a zone on which this blog is hosted) that has been running the same installation of Solaris since 11/11/2009, starting with OpenSolaris 2009.06. In the time since, it has seen every public build of OpenSolaris, then OpenIndiana, and finally Solaris 11 Express. Now, exactly two years later, I’ve updated it to Solaris 11 11/11, and I’d like to share my experience so far.

The update itself did not go smoothly. I was sitting at Solaris 11 Express SRU 8 and thought, like every update I’ve done in the past, that I could just run pkg image-update. Silly me, because when I did and then rebooted, the kernel panicked. No big deal, that’s what boot environments are for. I reverted to the previous boot environment and found some helpful documentation that told me to do exactly what I just did. It turns out that there is no way to update to SRU 13 using the support repositories because they already contain the Solaris 11 11/11 packages, and pkg tries to pull some of them in. And there is no way to update just pkg because the ips-consolidation prevents it, and trying to update the ips-consolidation pulls the entire package which breaks everything just the same. In short, Oracle bungled it. The only way to update to SRU 13 that I could see was to download the SRU 13 repository ISO from My Oracle Support and set up a local repository. Once I was on SRU 13, I could continue with the update to the 11/11 release. But there were more surprises in store for me.

First, it looks like pkg decided to start enforcing consistent attributes on files shared by multiple packages. Fine, I can understand that. As a result, I had to remove a lot of my custom packages (mostly from spec-files-extra) which I’ll have to rebuild. Second, pkg decided it doesn’t like the opensolaris.org packages anymore so I had to uninstall OpenOffice.org. Also fair enough.

Happily, after that, the updates got applied successfully and the system rebooted into the 11/11 release. Next came the zone updates. When I did the normal zoneadm -z foo detach && zoneadm -z foo attach -u deal, I was told I had to convert my zones to a new ZFS structure which more closely matches the global zone. The script /usr/lib/brand/shared/dsconvert actually worked flawlessly and the updated zones came up fine.

Unfortunately I couldn’t SSH into my zones because my DNS server didn’t know where they were. It seems that with the updated networking framework, DHCP doesn’t request a hostname anymore. (/etc/default/dhcpagent still says inet <hostname> can be put in /etc/hostname.<if> to request the hostname.) I found that you can create an addr object that requests a hostname with ipadm create-addr -T dhcp -h <hostname> <addrobj>, but NWAM pretty much won’t let you create or modify anything with ipadm, and there were no options for requesting hostnames with nwamcfg. As a result, I had to disable NWAM (netadm enable -p ncp DefaultFixed) and then I could set up the interface with ipadm. Why doesn’t Solaris request hostnames by default? Not very “cloud-like” if you ask me.

I have to say, I’m impressed by the way global zones and non-global zones are linked in the new release. Zone updates were an obvious shortcoming of previous releases. We’ll see how well it works when Solaris 11 Update 1 comes out.

What else…I lost my ability to pfexec to root. Oracle removed the “Primary Administrator” profile for security reasons so I had to install sudo. Not a big deal, I just wish they had said something a little louder about it.

Also, whatever update to pkg happened, it wiped out my repositories under /var/pkg. I had to restore them from a snapshot. Bad Oracle!

I’m also a little confused about some of the changes to the way networking settings are stored. For example, when I first booted the global zone, I found that my NFSv4 domain name was reset by NWAM. I set it to what it should be with sharectl set -p nfsmapid_domain=thestaticvoid.com nfs, but is that going to be overwritten again by NWAM? Also, the name resolver settings are now stored in the svc:/network/dns/client:default service, and according to the documentation, DHCP will set the service properties properly, but I have yet to see this work.

And the last problem I’ll mention is that the update removed my virtual consoles. I had to install the virtual-console package to restore them.

Overall, I’m happy that I was at least able to update to the latest release. Oracle could have cut off any update path from OpenSolaris. However, the update should have been a lot smoother. It doesn’t speak well of future updates when I can’t even update from one supported release (SRU 8) to another. I also wish Oracle were more open about upcoming changes (as in, having more preview releases or, dare I say it, opening development the way OpenSolaris was). Even to me, a long time pre-Solaris 11 user, the changes to zones and networking are huge in this release, and I would rather have not been so surprised by them.

Persistent Search Domains With NWAM and DHCP

What I Want

I want to be able to refer to systems on both my home and work networks by their hostnames rather than their fully-qualified domain names, so, ‘prey’ instead of ‘prey.thestaticvoid.com’ and ‘acad2’ instead of ‘acad2.es.gwu.edu’.

NWAM Settings

The Problem

I would typically set my home and work domains as the search setting in /etc/resolv.conf. Unfortunately, either NWAM or the Solaris DHCP client (I haven’t decided which) overwrites resolv.conf on every new connection. DHCP on Linux does the same thing, but I can configure it by editing dhclient.conf (or whatever is being used these days, it’s been a while. I think I just set my domains in the NetworkManager GUI and forget about it).

The Solaris DHCP client configuration is not nearly as flexible, and neither is NWAM which gives you the option of replacing resolv.conf with information supplied by the DHCP server, or provided by you, but not a mix of both. I do like having the nameservers set by the DHCP server, so supplying a manual configuration is not an option.

What I Tried

The first thing I tried was setting the LOCALDOMAIN environmental variable in /etc/profile. From the resolv.conf man page:

You can override the search keyword of the system
resolv.conf file on a per-process basis by setting the
environment variable LOCALDOMAIN to a space-separated list
of search domains.

I thought, great, a way to manage domain search settings without worrying about what’s doing what to resolv.conf. It didn’t work as advertised:

% LOCALDOMAIN=thestaticvoid.com ping prey
ping: unknown host prey
% s touch /etc/resolv.conf
% LOCALDOMAIN=thestaticvoid.com ping prey
prey is alive
% LOCALDOMAIN=thestaticvoid.com ping prey
ping: unknown host prey

Next, I considered adding an NWAM Network Modifier to set my search string in resolv.conf after a new connection is established. This worked reasonably well, but didn’t handle the case when you switch from one network to another, for example, from wireless to wired. The only events in NWAM that can trigger a script when the network connection changes happens before DHCP messes up resolv.conf.

Finally, in the course of my testing, I discovered that the svc:/network/dns/client service was restarting with every network connection change. I looked into its manifest and saw that it was designed to wait for changes to resolv.conf:

<!--
 Wait for potential DHCP modification of resolv.conf.
-->
<dependency
   name='net'
   grouping='require_all'
   restart_on='none'
   type='service'>
    <service_fmri value='svc:/network/service' />
</dependency>

So I could write another service which depends on dns/client and restarts whenever dns/client does and I would have the last word about what goes into my configuration file!

My Solution

I wrote a service, svc:/network/dns/resolv-conf, with the following manifest:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle type="manifest" name="dns-resolv-conf">
    <service name="network/dns/resolv-conf"
        type="service"
        version="1">
        <create_default_instance enabled="false" />
        <single_instance />

        <dependency name="dns-client"
            grouping="require_all"
            restart_on="restart"
            type="service">
            <service_fmri value="svc:/network/dns/client" />
        </dependency>

        <dependent name="resolv-conf"
            grouping="optional_all"
            restart_on="restart">
            <service_fmri value="svc:/milestone/name-services" />
        </dependent>

        <exec_method type="method"
            name="start"
            exec="/lib/svc/method/dns-resolv-conf start"
            timeout_seconds="60" />

        <exec_method type="method"
            name="stop"
            exec="/lib/svc/method/dns-resolv-conf stop"
            timeout_seconds="60" />

        <property_group name="options" type="application">
            <propval name="search" type="astring" value="" />
        </property_group>

        <property_group name="startd" type="framework">
            <propval name="duration" type="astring" value="transient" />
        </property_group>

        <stability value="Unstable" />

        <template>
            <common_name>
                <loctext xml:lang="C">resolv.conf Settings</loctext>
            </common_name>
            <documentation>
                <manpage title="resolv.conf" section="4"
                    manpath="/usr/share/man" />
            </documentation>
        </template>
    </service>
</service_bundle>

which calls the script, /lib/svc/method/dns-resolv-conf containing:

#!/sbin/sh

. /lib/svc/share/smf_include.sh

search=$(svcprop -p options/search $SMF_FMRI)

case "$1" in
    "start")
        # Don't do anything if search option not provided.
        [ "$search" == '""' ] && exit $SMF_EXIT_OK

        # Reverse the lines because we either want to:
        #   add the search line after the *last* domain line or
        #   add it to the very top of the file if there is no domain line
        tac /etc/resolv.conf | grep -v "^search" | gawk '
            /^domain/ {
                if (!isset) {
                    print "search", $2, search
                    isset=1
                }
            }

            END {
                if (!isset) {
                    print "search", search
                }
            }

            1
        '
search="$search" | tac > /etc/resolv.conf.new && mv -f /etc/resolv.conf.new /etc/resolv.conf
        ;;

    "stop")
        # Just get rid of any search lines, I guess.
        grep -v "^search" /etc/resolv.conf > /etc/resolv.conf.new && mv -f /etc/resolv.conf.new /etc/resolv.conf
        ;;

    *)
        echo "Usage: $0 { start | stop }"
        exit $SMF_EXIT_ERR_CONFIG
esac

exit $SMF_EXIT_OK

So now I can set my search options like:

% svccfg -s resolv-conf setprop 'options/search="thestaticvoid.com es.gwu.edu"'
% svcadm refresh resolv-conf
% svcadm enable resolv-conf
% cat /etc/resolv.conf
domain  iss.gwu.edu
search iss.gwu.edu thestaticvoid.com es.gwu.edu
nameserver  161.253.152.50
nameserver  128.164.141.12

Problem solved! Or at least worked-around in the least hacky way I can!

Fun With vpnc

I recently got a new laptop at work and I decided to put OpenSolaris on it. This meant I had to setup vpnc in order to access the server networks and wireless here. I installed my vpnc package, copied the profile from my Ubuntu workstation, and started it up. It connected, but no packets flowed. I didn’t have time to investigate, so I decided to work on it some more at home.

The strange thing is that it connected from home with the very same profile and everything worked fine. I immediately suspected something was wrong with the routing tables, like maybe some of the routes installed by vpnc-script were conflicting with the routes necessary to talk to the VPN concentrator. I endlessly compared the routing tables between work and home and my working Ubuntu workstation, removing routes, adding routes, and manually constructing the routing table until I was positive it could not be that.

Everything I pinged worked. I could ping the concentrator. I could ping the gateway. I could ping the tunnel device. I could ping the physical interface—or so I thought.

As I was preparing to write a message to the vpnc-devel mailing list requesting help, I did some pings to post the output in the email. I ran

$ ping <concentrator ip>
<concentrator ip> is alive

which looked good, but I wanted the full ping output, so I ran

$ ping -s <concentrator ip>
PING <concentrator ip>: 56 data bytes
^C
----<concentrator ip> PING Statistics----
4 packets transmitted, 1 packets received, 75% packet loss
round-trip (ms)  min/avg/max/stddev = 9223372036854776.000/0.000/0.000/-NaN

For some reason, only the first ping was getting through. The rest were getting hung up somewhere. The really strange thing was that I saw the same behavior on the local physical interface:

$ ifconfig bge0
bge0: flags=1004843 mtu 1500 index 3
        inet 161.253.143.151 netmask ffffff00 broadcast 161.253.143.255
$ ping -s 161.253.143.151
PING 161.253.143.151: 56 data bytes
^C
----161.253.143.151 PING Statistics----
5 packets transmitted, 1 packets received, 80% packet loss
round-trip (ms)  min/avg/max/stddev = 9223372036854776.000/0.000/0.000/-NaN

I have never seen a situation where you couldn’t even ping a local physical interface! I checked and double checked that IPFilter wasn’t running. Finally I started a packet capture of the physical interface to see what was happening to my pings:

# snoop -d bge0 icmp
Using device bge0 (promiscuous mode)
161.253.143.151 -> <concentrator ip> ICMP Destination unreachable (Bad protocol 50)
161.253.143.151 -> <concentrator ip> ICMP Destination unreachable (Bad protocol 50)
161.253.143.151 -> <concentrator ip> ICMP Destination unreachable (Bad protocol 50)
^C

That’s when by chance I saw messages being sent to the VPN concentrator saying “bad protocol 50.” IP protocol 50 represents “ESP”, commonly used for IPsec. Apparently Solaris eats these packets. Haven’t figured out why.

I remembered seeing something in the vpnc manpage about ESP packets:

--natt-mode <natt/none/force-natt/cisco-udp>

      Which NAT-Traversal Method to use:
      o    natt -- NAT-T as defined in RFC3947
      o    none -- disable use of any NAT-T method
      o    force-natt -- always use NAT-T encapsulation  even
           without presence of a NAT device (useful if the OS
           captures all ESP traffic)
      o    cisco-udp -- Cisco proprietary UDP  encapsulation,
           commonly over Port 10000

I enabled force-natt mode, which encapsulates the ESP packet in a UDP packet, normally to get past NAT, and it started working! In retrospect, I should have been able to figure that out much easier. First, it pretty much says it on the vpnc homepage: “Solaris (7 works, 9 only with –natt-mode forced).” I didn’t even notice that. Second, I should have realized that I was behind a NAT at home and not at work, so they would be using a different NAT-traversal mode by default. Oh well, it was a good diagnostic exercise, hence the post to share the experience.

In other vpnc related news, I’ve ported Kazuyoshi’s patch to the open_tun and solaris_close_tun functions of OpenVPN to the tun_open and tun_close functions of vpnc. His sets up the tunnel interface a little bit differently and adds TAP support. It solves the random problems vpnc had with bringing up the tunnel interface such as:

# ifconfig tun0
tun0: flags=10010008d0<POINTOPOINT,RUNNING,NOARP,MULTICAST,IPv4,FIXEDMTU> mtu 1412 index 8
        inet 128.164.xxx.yy --> 128.164.xxx.yy netmask ffffffff 
        ether f:ea:1:ff:ff:ff
# ifconfig tun0 up
ifconfig: setifflags: SIOCSLIFFLAGS: tun0: no such interface
# dmesg | grep tun0
Jul 23 14:56:05 swan ip: [ID 728316 kern.error] tun0: DL_BIND_REQ failed: DL_OUTSTATE

The changes are in the latest vpnc package available from my package repository.