Wednesday, October 31, 2012

Finding the majority element in a stream of numbers

Some time ago I came across the following question.
As input a finite stream stream of numbers is provided. Define an algorithm to find the majority element of the input. The algorithm need not provide a sensible result if no majority element exists. You may assume a transdichotomous memory model.
There are a few definitions which may not be immediately clear:
Stream
A possibly infinite set of data which may not be reused in either the forward or backward direction without explicitly storing it.
Majority element
An element in a set which occurs more than half the time.
Transdichotomous
The integer size is equal to the word size of memory. One does not need to worry about storing partial pieces of integers in separate memory units.
Unfortunately this answer isn't of my own invention, but it is interesting and succinct.

The algorithm (click to view) Using 3 registers the accumulator, the guess and the current element (next):
  1. Initialize accumulator to 0
  2. Accept the next element of the stream and place it into next. If there are no more elements go to step #7.
  3. If accumulator is 0 place next into guess and increment accumulator.
  4. Else if guess matches next increment accumulator
  5. Else decrement accumulator
  6. Go to step 2
  7. Return the value in guess as the result

An interesting property of this algorithm is that it can be implemented in $O(n)$ time even on a single tape Turing Machine.

Tuesday, October 30, 2012

Cneonction: closed HTTP header

When you make a request to certain websites you may find an unusual header that looks a little strange:

[8000 eitan@radar ~ ]%curl -I http://www.imdb.com/ 2>/dev/null|grep close
Cneonction: close
[8001 eitan@radar ~ ]%curl -I http://maps.apple.com/ 2>/dev/null|grep close
Cneonction: close

This isn't a typo though. Some load balancers that sit between the web server and end user want to implement HTTP keep-alive without modifying the back end web server. The load balancer therefore has to add "Connection: Keep-Alive" to the HTTP header and also has to elide the "Connection: close" from the real webserver. However, if it completely removes the line the load balancer (acting as a TCP proxy) would have to stall before forwarding the complete text in order to recompute the TCP checksum. This increases latency on packet delivery.

Instead, the proxy uses a hack to keep the checksum unchanged. The TCP checksum of a packet is the 1s complement summation of all the 16 bit words (the final word might be right padded with zeros).[1] By manipulating the ordering, but not the content of the header the proxy can avoid changing the TCP checksum except by the fixed amount that the "Connection: Keep-Alive" adds (2061).

In particular:

>>>sum(ord(i) for i in "Connection") - sum(ord(i) for i in "Cneonction")

0

This reordering also keeps the packet size the same.

  1. RFC793
Edit 2012-10-31: Make the RFC a link and remove pointless "2>&1"
Thanks abbe for the inspiration! Thanks wxs for the proofreading.

Tuesday, October 9, 2012

Reduced Entropy in rand() and random()

TL;DR: Don't rely on undefined behavior, even when you think it should work.

I recently reported a minor issue to the FreeBSD security team.

The libc random functions had code1,2 designed to run when /dev/random is not available. This can easily occur in a chroot or jail environment.

if (!done) {
        struct timeval tv;
        unsigned long junk;

        gettimeofday(&tv, NULL);
        srandom((getpid() << 16) ^ tv.tv_sec ^ tv.tv_usec ^ junk);
        return;
}

This code is designed provide a minimal amount of entropy in the "failure" case. Unfortunately, it doesn't even provide the entropy it claims to. This is a minor issue because getpid, getimeday, and a single long variable don't provide a lot of entropy in the first place: (only $log_2{sizeof(long)}$ bits).

The point of the junk value is to add entropy by using uninitialized memory. This relies on the compiler being "stupid" enough not optimize it away.

Unfortunately clang and newer versions of gcc are smart enough to use the undefined behavior in undesired ways.

clang 3 removes any computation which relies on the undefined behavior and so produces the following object code:

 af0:   e8 5f fc ff ff          callq  754 <gettimeofday@plt>
 af5:   e8 7a fc ff ff          callq  774 <getpid@plt>
 afa:   e8 65 fc ff ff          callq  764 <srandom@plt>

Note that the junk variable is entirely unused and that the xor operation between gettimeofday and getpid is non-existent.

gcc 4.6 4 outputs:

 ce8:   e8 03 fa ff ff          callq  6f0 <gettimeofday@plt>
 ced:   e8 4e fa ff ff          callq  740 <getpid@plt>
 cf2:   48 8b 7c 24 08          mov    0x8(%rsp),%rdi
 cf7:   48 33 3c 24             xor    (%rsp),%rdi
 cfb:   c1 e0 10                shl    $0x10,%eax
 cfe:   48 98                   cltq
 d00:   48 31 c7                xor    %rax,%rdi
 d03:   e8 28 fa ff ff          callq  730 <srandom@plt>
Note that in this case the junk value appears to be (%rsp) which isn't all that random.

gcc 4.25 produces the following code

with the junk variable
 d9f:   e8 18 fa ff ff          callq  7bc <gettimeofday@plt>
 da4:   e8 43 fa ff ff          callq  7ec <getpid@plt>
 da9:   48 8b 3c 24             mov    (%rsp),%rdi
 dad:   48 33 7c 24 08          xor    0x8(%rsp),%rdi
 db2:   c1 e0 10                shl    $0x10,%eax
 db5:   48 98                   cltq
 db7:   48 31 c7                xor    %rax,%rdi
 dba:   48 31 df                xor    %rbx,%rdi
 dbd:   e8 1a fa ff ff          callq  <srandom@plt>
and without:
 d9f:   e8 18 fa ff ff          callq  7bc <gettimeofday@plt>
 da4:   e8 43 fa ff ff          callq  7ec <getpid@plt>
 da9:   48 8b 3c 24             mov    (%rsp),%rdi
 dad:   48 33 7c 24 08          xor    0x8(%rsp),%rdi
 db2:   c1 e0 10                shl    $0x10,%eax
 db5:   48 98                   cltq
 db7:   48 31 c7                xor    %rax,%rdi
 dba:   e8 1d fa ff ff          callq  7dc <srandom@plt>

The base version of gcc isn't vulnerable. However, with the upcoming switch to the clang compiler the FreeBSD does become vulnerable.

The first proposed fix was to add the volatile type qualifier to the junk variable. While this seemed to fix the code generation issue I didn't believe this to be a valid fix as the behavior is still undefined6 (I misread text of the standard). Additionally, the value is likely have be very predictable. A preference was expressed to remove the junk variable as using it may leak a small amount of stack data.

I proposed the simple and obvious fix of removing the use of the junk variable7

In a brief survey of other libraries I noticed similar issues. I will attempt to notify the vendors

It should be obvious, but undefined behavior is undefined and can't be relied on to ever to give a sensible result.

  1. random.c r165903 (line 316)
  2. rand.c r241046 (line 131)
  3. FreeBSD clang version 3.1 (branches/release_31 156863) 20120523 compiled with -O2 or -O3
  4. gcc46 (FreeBSD Ports Collection) 4.6.4 20120608 (prerelease) compiled with -O2
  5. gcc (GCC) 4.2.1 20070831 patched [FreeBSD] compiled with -O3
  6. sections 5.1.2.2.3 and 6.7.2.4 of ISO9899
  7. svn commit r241373
Edit 2010-10-10: update the paragraph referring to undefined behavior of volatile.

Monday, October 8, 2012

NFS Mount Network Booting VirtualBox

In order to test modifications to the FreeBSD kernel, I would like to boot a diskless virtual machine. The goal is to be able to quickly test changed I make without needing to recreate the VM each time, or run the installation procedure inside of the VM.

You can elect to put the root anywhere you want. For this example, I will be using /home/vm as the root of the virtual machine.

There are a few aspects to this:

  • The filesystem that will be booted.
  • VirtualBox setup
  • Virtual machine setup
  • DHCP server
  • Host setup

Filesystem Installation

  1. Create the distribution:
    cd /usr/src 
    make installworld installkernel distribution DESTDIR=/home/vm
    
  2. Create the /conf directory used for diskless boot. See /etc/rc.initdiskless for details.
    # cd /home/vm/
    # mkdir -p conf/base/etc
    # cd conf/base/etc
    # cat diskless_remount 
    /etc
    # cat fstab 
    md    /tmp    mfs    -s=30m,rw    0 0
    md    /var    mfs    -s=30m,rw    0 0
    # cat md_size 
    256m
    

VirtualBox Setup

Create a "host-only" interface with DHCP disabled. To do this:

  1. Open up the preferences screen and go the "network" tab.
    screenshot of the preferences screen
  2. Create a new host-only network (the green plus on the right).
    screenshot of the network preferences screen
  3. Disable the DHCP Server.
    screenshot of the host only network DHCP screen
  4. Select an IP Address in the range your VM will have.
    screenshot of the host only network preferences screen
  5. Note that the DHCP server will continue to run until killed or the machine rebooted:
    ## Note that you may kill too much here. Be careful.
    # pgrep -fl DHCP
    # pkill -15 DHCP
    

Create the Virtual Machine

  1. Create a new virtual machine. Make sure to select "FreeBSD - 64 bit". Note that you should create this machine without any disk (and ignore the warning at the end).
    filled in machine name and type screenshot
  2. Open up the virtual machine's settings
    default settings screenshot
  3. Select only the "Network" option under boot order (System->Motherboard->Boot Order). Also check "Hardware clock in UTC time."
    changed setting screen screenshot
  4. Under network select the Host Only Adapter created before. Under advanced options, change the adapter type to "PCNet-PCI II." Also make a note of the mac address of the virtual machine. It will be needed later.
    changed network setting screen screenshot

DHCP Server

For simplicity I opted to use a really simple DHCP server dnsmasq:
  1. Install dnsmasq
    $ make -C /usr/ports/dns/dnsmasq install clean
    
  2. Configure dnsmasq to server as a tftp and DHCP server for the virtual machine interface. Modify /usr/local/etc/dnsmasq.conf (I've bolded the parts that are configuration specific):
    interface=vboxnet0
    dhcp-range=172.16.100.100,172.16.100.200,48h[1]
    #Note that this mac address is the mac address of the vm noted earlier.
    dhcp-host=01:02:03:04:05:06,vm-testing,172.16.100.100,set:diskless
    dhcp-boot=tag:diskless,/home/vm/boot/pxeboot
    dhcp-option=tag:diskless,option:root-path,"/home/vm/"
    enable-tftp
    
  3. Add this to /etc/dhclient.conf so that you can reference your virtual machine by name:
    precede domain-name-servers 127.0.0.1;
    

Host Setup

  1. Add the following to /etc/rc.conf:
    # Required for VirtualBox
    devfs_system_ruleset="system" 
    
    # Required for diskless boot
    dnsmasq_enable="YES"
    
    # NFS Exports
    rpcbind_enable="YES"
    nfs_server_enable="YES"
    mountd_enable="YES"
    mountd_flags="-r"
    nfs_client_enable="YES"
    weak_mountd_authentication="YES"
    
  2. Add the directory to /etc/exports:
    /home/vm -ro -alldirs -network 172.16.0.0 -maproot=root
    

Now you are all set. Just boot the virtual machine from VirtualBox and watch it go!


  1. Note that the 172.16.0.0/20 network is RFC 1918 private address space.

Wednesday, October 3, 2012

#!/bin/bash considered harmful

When one writes a shell script there are a variety of shebang lines that could be used:

  • #!/bin/sh
  • #!/usr/bin/env bash
  • #!/bin/bash

or one of many other options.

Of these only the first two are possibly correct.

Using #!/bin/bash is wrong because:

  • Sometimes bash isn't installed.
  • If it is installed, it may not be in /bin
  • If it is in /bin, the user may have decided to set PATH to use a different installation of bash. Using an absolute path like this overrides the user's choices.
  • bash shouldn't be used for scripts intended for portability

If you have bash specific code use #!/usr/bin/env bash. If you want more portable code try using Debian's checkbashism to find instances of non-POSIX compliant shell scripting.

Monday, October 1, 2012

Some git-svn notes

When working with a subversion repository I often miss the use git features. However, it is possible for git to speak the subversion protocol:

Cloning the initial repository

$git svn clone svn://svn.freebsd.org/base/head


Resuming an interrupted git svn clone

It is really annoying when you start a git svn clone process overnight and come back to find that it stopped in the middle. Luckily, there is a really simple way to recover - without spending hours to redownload what you already have.

$git svn fetch
$git svn rebase


Committing the final patch

There are a lot of workflows for this, but I prefer the cherry-pick approach:

$git checkout master
$git svn rebase
$git cherry-pick commit-id
$git svn dcommit