Eitan Adler's thoughts

Impossible Bugs

2018-10-11T14:06:00.000-07:00

Many bugs are confusing. Others are are annoying. Some are just impossible. This is a list of those bugs:

MRI disabled every iOS device in facility
We can't send mail more than 500 miles
OpenOffice.org can't print on Tuesday (see comment 28)
I can't log in when I stand up. (and another similar story)
A story about "magic"
Print this file, your printer will jam
gcj crashes in April and December, but only if you speak German in Austria
Processor 5 doesn't work if you're standing too close
A car that is allergic to vanilla ice cream
Some employees change the monitor's resolution without touching it.
The computer is filled with bees
My chair turns off my monitor (via tweet)

I'm sure there are more. Let me know!
Updated 2019-01-22: added additional bug
Updated 2020-01-07: added chair turning off monitor

Good Defaults for Technical Decisions

2018-06-01T15:28:00.000-07:00

In my experience as a software engineer I've found a few "rules of thumb" for technical decisions. None of these are hard requirements or things that can never be false. However these are good guidelines if you don't have a reason to make a different decision. Unlike most engineering decisions which first present the constraints and then try and find a solution within them, this attempts to document decisions one should make if you didn't have any constraints in the first place.
Its possible you'll disagree with me on some of these and I'd like to understand why. That said, I'm not interested in specific projects where these are a bad idea but for an understanding about why these shouldn't be the default.

Be explicit about your requirements. Don't automatically detect features, dependencies, or environment related issues. It is easier to change this later to make things more "magical" than go back and figure out what you need.
Namespaces are good: even if you think you'll only ever need one. Its easier to modify in the future, versioning, etc.
Errors ought to error. Warning ought to error or not exist. It is generally unhelpful to have noise in your output that you do nothing with. If a warning isn't meaningful, disable it.
Keep scope local and private. Prefer hiding functions and information from the outside unless you have to decided to make this an API.
Naming your first version as a "v1" and label it as such. During rewrites, migrations, or related issues prefer versions rather than names such as "next" or "new". There will likely be many "nexts" and only one v2.
Structured is better than unstructured. Similar to the point about explicitness: it is easier to go from structured to unstructured than vice versa.
Fixed is better than editable. Don't let people change things unless there is a reason to. This also applies to code (immutable is better than mutable).
Don't rely on people not making mistakes. Even if you have perfect people, they might be tired, have something in their eye, misremember a fact, or otherwise be operating at a sub-optimal state.
Name same things the same and different things differently. Use, and accept, the same formats for the same thing at all layers of the stack. As a counter example ruby outputs missing gems as name-version but gem(1) expects name:version.

This is a work in progress document and I'll try and update it over time

Some rules for designing libc style APIs

2018-02-17T01:19:00.001-08:00

Identifiers should not have vowels; they are expensive and difficult to type.
An identifier must not be longer than 8 characters. The only exception are functions intended for standardization like sched_ss_init_budget.
Functions must not be reentrant. Relying on internal state means you can avoid allocating memory.
Functions should take at least two parameters. The second parameter should be a "flags" parameter which causes the function to do entirely different things.
Flags should be passed as macros with unspecified values. These macros must not have reasonable values.
Error handling must be done in one of two ways. The choice must not be consistent with other functions in the library:
1. The real return value should be stored in an "out" parameter. The return value must only determine if an error has occurred or not.
2. If an error occurs, the return value must be undefined. The return value can't be safely used without checking for errors using a separate function (e.g., fgets).
The error code should be in errno, requiring the setting of `errno = 0` beforehand and checking after an error occurs. However, the return value should be a value legally allowed to be in errno, so that initial attempts to use the function appear to work.
If the function returns a string, it must do so by modifying a memory location given as a parameter. Whether or not the string is terminated with a null must be determined solely based on the length of the output, a user supplied parameter, and choice of compiler.

Thanks to jp, okdana for the inspiration and review; thanks to gonzo, arbrock for review.

Papers We Read

2016-11-27T13:29:00.003-08:00

Some months ago I started a reading group at my workplace focussed on distributed systems. The goal of the group was to be an informal meeting to discuss a mixture of high impact, historical, and modern papers.

This is the list of papers we read:

Lamport Time Clocks
Spanner: Google’s Globally-Distributed Database
The Chubby Lock Service for Loosely-Coupled
A note on distributed computing
The Byzantine Generals Problem
Your computer is already a distributed system. Why isn't your OS?
How Complex Systems Fail
Fast and Message-Efficient Global Snapshot Algorithms for Large-Scale Distributed Systems
Automatic Management of Partitioned, Replicated Search Services
Simple Testing Can Prevent Most Critical Failures
Dynamo: Amazon’s Highly Available Key-value Store
Wait-free coordination for Internet-scale system
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Kafka: a Distributed Messaging System for Log Processing
DistributedLog: A high performance replicated log service
The Log: What every software engineer should know about real-time data's unifying abstraction
Social Hash: an Assignment Framework for Optimizing Distributed Systems Operations on Social Networks
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
MapReduce: Simplified Data Processing on Large Clusters
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Snowflake - Unique ID Generation. “No two snowflakes are alike.”
The Hadoop Distributed File System
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
Meltdown
Spectre Attacks: Exploiting Speculative Execution
Communicating Sequential Processes
The Tail at Scale
Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
Dapper, A Large Scale Distributed Systems Tracing Infrastructure
The many faces of consistency
SDPaxos: Building efficient semi-decentralized geo-replicated state machines
Dataflow Model
How to read a paper
Jupiter Rising: A Decade of Clos Topologies andCentralized Control in Google’s Datacenter Network
Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook
Harvest, Yield, and Scalable Tolerant Systems
Lineage stash: fault tolerance off the critical path.
F1 Query: Declarative Querying at Scale
The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
EFLOPS: Algorithm and System Co-design for a High Performance Distributed Training Platform

Other papers mentioned but not discussed:

Updated 2018-05-31: added additional papers
Updated 2018-06-20: added additional papers. Linkified a few more
Updated 2018-10-06: added additional papers. Linkified a few more
Updated 2019-07-10: added additional papers. Linkified a few more
Updated 2022-04-05: added additional papers. Linkified a few more

Blogging My Way Through CLRS Section 4.1

2015-06-15T21:13:00.000-07:00

After another long break of not writing up any CLRS answers here is section 4.1.

Question 4.1-1:
What does $\textit{Find-Maximum-Subarray}$ return when all elements of $A$ are negative?

The procedure would return the single element of maximum value. This is expected since the maximum subarray must contain at least one element. This can be computed by note that $\textit{Find-Max-Crossing-Subarray}$ will always return the array of solely the midpoint and that $\textit{Find-Maximum-Subarray}$ always finds the maxium of $\{leftsum, rightsum, and crosssum\}$

Question 4.1-2:
Write pseudocode for the brute-force method of solving the max-subarray problem. Your solution should run in $\theta(n^2)$ time.

max_i = nil max_j = nil max_sum = -∞ for i in 0..len(A): cur_sum = 0 for j in i..len(A): cur_sum += A[j] if cur_sum > max_sum: max_sum = cur_sum max_i = i max_j = j return (max_i, max_j, max_sum)

Question 4.1-3:
Implement both the brute-force and recursive algorithms for the maximum-subarray problem on your own computer. What problem size $n_0$ gives the crossover point at which the recursive algorithm beats the brute-force algorithm? Then, change the base case of the recursive algorithm to use the brute-force algorithm whenever the problem size is less than $n_0$. Does that change the crossover point?

This question asks a question that is specific to the implementation, and the computer on which it is run. I will therefore be skipping it in this writeup. It is worthwhile to note that it is almost guarenteed that changing he implementation to use the brute force method for values less than $n_0$ is very likely to change $n_0$.

Question 4.1-4:
Suppose we change the definition of the maximum-subarray problem to allow the result to be an empty subarray, where the sum of the values of an empty subarray is 0. How would you change any of the algorithms that do not allow empty subarrays to permit an empty subarray to be the result?

For the brute force algorithm it would be rather trivial to add a check, and if the return max_sum is > 0 return the empty array.

For the recursive divide and conquer algorithm is is sufficient to just change the $\textit{Find-Max-Crossing-Subarray}$ in a manner similar to the brute force method. If $\textit{Find-Max-Crossing-Subarray}$ return the correct value, then $\textit{Find-Maximum-Subarray}$ will do the correct thing.

Question 4.1-5:
Develop a nonrecursive linear-time algorithm for the maximum-subarray problem.^[1]

If one knows a previous answer to the max-subarray problem for a given prefix of the array than any new element consists of only two cases: being part of the maximum subarray or not being part of the maximum subarray. It is easier to explain with pseudocode: max_start = 0 max_end = 0 max_sum = A[0] max_with_j = A[0] for j in 1..len(A): # If J is in a maximum-subarray, either j is going to being the maximum on its, or it will will add to the current max max_with_j = max(A[j], max_with_j + x) Determine if J is in a maximum-subarray if max_with_j >= max_sum: max_sum = max_with_j max_end = j #Set the starting value if j is the sole element of a new subarray if max_with_j == A[j]: max_start = j return (max_start, max_end, cur_max)

The question provides some hints as to the solution of the problem.

FreeBSD SMB Client under OSX Host

2015-03-29T21:23:00.001-07:00

I recently purchased a new Macbook Pro and wanted to get a FreeBSD Virtual Machine set up in order to continue doing development work on it. Unfortunately, FreeBSD as a guest does not support native folder sharing so I decided to try using a samba mounted.

I decided to set up my VM to have two network interfaces: a NATed interface for internet access and a host-only interface for access to SMB and ssh.

The NAT networking configuration looks like:

NetworkName:    FreeBSDNatNetwork
IP:             10.0.2.1
Network:        10.0.2.0/24
IPv6 Enabled:   Yes
IPv6 Prefix:
DHCP Enabled:   Yes
Enabled:        Yes
Port-forwarding (ipv4)
        SSH IPv4:tcp:[]:5022:[10.0.2.4]:22
Port-forwarding (ipv6)
        FreeBSD ssh:tcp:[]:6022:[fd17:625c:f037:2:a00:27ff:fefc:9dab]:22
loopback mappings (ipv4)

The Host-Only networking configuration looks like:

Name:            vboxnet0
GUID:            786f6276-656e-4074-8000-0a0027000000
DHCP:            Disabled
IPAddress:       192.168.56.1
NetworkMask:     255.255.255.0
IPV6Address:     
IPV6NetworkMaskPrefixLength: 0
HardwareAddress: 0a:00:27:00:00:00
MediumType:      Ethernet
Status:          Up
VBoxNetworkName: HostInterfaceNetworking-vboxnet0

The FreeBSD configuration looks like this: The OSX sharing configuration looks like:

Unfortunately, when attempting to actually mount the SMB filesystem with: mount_smbfs -I 192.168.56.1 //eax@192.168.56.1/shared_vbox I get the error mount_smbfs: can't get server address: syserr = Operation timed out

I tried installing the package net/samba36 and found that I needed the --signing=off flag to let it work:

It seems based on this setup and research that FreeBSD can not natively mount an OSX samba share. It might be possible to use sysutils/fusefs-smbnetfs. Other people have recommended NFS or sshfs.

Two Factor Authentication for SSH (with Google Authenticator)

2013-11-03T21:43:00.000-08:00

Two factor authentication is a method of ensuring that a user has a physical device in addition to their password when logging in to some service. This works by using a time (or counter) based code which is generated by the device and checked by the host machine. Google provides a service which allows one to use their phone as the physical device using a simple app.

This service can be easily configured and greatly increases the security of your host.

Installing Dependencies

There is only one: the Google-Authenticator software itself:
```
# pkg install pam_google_authenticator
```

# pkg_add -r pam_google_authenticator

# apt-get install libpam-google-authenticator

User configuration

Each user must run "google-authenticator" once prior to being able to login with ssh. This will be followed by a series of yes/no prompts which are fairly self-explanatory. Note that the alternate to time-based is to use a counter. It is easy to lose track of which number you are at so most people prefer time-based.

$ google-authenticator
Do you want authentication tokens to be time-based (y/n)
...

Make sure to save the URL or secret key generated here as it will be required later.

Host Configuration

To enable use of Authenticator the host must be set up to use PAM which must be configured to prompt for Authenticator.

Edit the file /etc/pam.d/sshd and add the following in the "auth" section prior to pam_unix:
auth requisite pam_google_authenticator.so
Edit /etc/ssh/sshd_config and uncomment
ChallengeResponseAuthentication yes

Reload ssh config

Finally, the ssh server needs to reload its configuration:
```
# service sshd reload
```

Configure the device

Follow the instructions provided by Google to install the authentication app and setup the phone.

That is it. Try logging into your machine from a remote machine now

Thanks bcallah for proof-reading this post.

Pre-Interview NDAs Are Bad

2013-04-28T18:37:00.000-07:00

I get quite a few emails from business folk asking me to interview with them or forward their request to other coders I know. Given the volume it isn't feasible to respond affirmatively to all these requests.

If you want to get a coder's attention there are a lot of things you could do, but there is one thing you shouldn't do: require them to sign an NDA before you interview them.

From the candidates point of view:

There are a lot more ideas than qualified candidates.
Its unlikely your idea is original. It doesn't mean anyone else is working on it, just that someone else probably thought of it.
Lets say the candidate was working on a similar, if not identical project. If the candidate fails to continue with you now they have to consult a lawyer to make sure you can't sue them for a project they were working on before
NDAs are hard legal documents and shouldn't be signed without consulting a lawyer. Does the candidate really want to find a lawyer before interviewing with you?
An NDA puts the entire obligation on the candidate. What does the candidate get from you?

From a company founders point of view:

Everyone talks about the companies they interview with to someone. Do you want to be that strange company which made them sign an NDA? It can harm your reputation easily.
NDAs do not stop leaks. They serve to create liability when a leak occurs. Do you want to be the company that sues people that interview with them?

There are some exceptions; for example government and security jobs may require security clearance and an NDA. For those jobs it is possible to determine if a coder is qualified and a good fit without disclosing confidential company secrets.

Correctly Verifying an Email Address

2012-12-21T14:23:00.000-08:00

Some services that accept email addresses want to ensure that these email addresses are valid.

There are multiple aspects to an email being valid:

The address is syntactically valid.
An SMTP server accepts mail for the address.
A human being reads mail at the address.
The address belongs to the person submitting it.

How does one verify an email address? I'll start with the wrong solutions and build up the correct one.

Possibility #0 - The Regular Expression

Discussions on a correct regular expression to parse email addresses are endless. They are almost always wrong. Even really basic pattern matching such as *@*.* is wrong: it will reject the valid email address n@ai.^[5]

Even a fully correct regular expression does not tell you if the mailbox is valid or reachable.

This scores 0/4 on the validity checking scale.

Possibility #1 - The VRFY Command

The oldest mechanism for verifying an email address is the VRFY mechanism in RFC821 section 4.1.1:

VERIFY (VRFY) This command asks the receiver to confirm that the argument identifies a user. If it is a user name, the full name of the user (if known) and the fully specified mailbox are returned.

However this isn't sufficient. Most SMTP servers disable this feature for security and anti-spam reasons. This feature could be used to enumerate every username on the server to perform more targeted password guessing attacks:

Both SMTP VRFY and EXPN provide means for a potential spammer to test whether the addresses on his list are valid (VRFY)... Therefore, the MTA SHOULD control who is is allowed to issue these commands. This may be "on/off" or it may use access lists similar to those mentioned previously.

This feature wasn't guaranteed to be useful at the time the RFC was written:^[1]

The VRFY and EXPN commands are not included in the minimum implementation (Section 4.5.1), and are not required to work across relays when they are implemented.

Finally, even if VRFY was fully implemented there is no guarantee that a human being reads the mail sent to that particular mailbox.

All of this makes VRFY useless as a validity checking mechanism so it scores 1/4 on the validity checking scale.

Possibility #2 - Sending a Probe Message

With this method you try to connect with a mail server and pretends to send a real mail message but cut off before sending the message content. This is wrong for a for the following reasons:

A system administrator that disabled VRFY has a policy of not allowing for the testing for email addresses. Therefore the ability to test the email address by sending a probe should be considered a bug and must not be used.

The system might be set up to detect signs up of a probe such as cutting off early may rate limit or block the sender.

In addition, the SMTP may be temporarily down or the mailbox temporarily unavailable but this method provides no resilience against failure. This is especially true if this mechanism is attempting to provide real-time feedback to the user after submitting a form.

This scores 1/4 on the validity checking scale.

Possibility #3 - Sending a Confirmation Mail

If one cares about if a human is reading the mailbox the simplest way to do so is send a confirmation mail. In the email include a link to a website (or set a special reply address) with some indication of what is being confirmed. For example, to confirm "user@example.com" is valid the link might be http://example.com/verify?email=user@example.com or http://example.com/verify?account=12345^[2].

This method is resilient against temporary failures and forwarders. Temporary failures could be retried like a normal SMTP conversation.

This way it is unlikely that a non-human will trigger the verification email^[3]. This approach solves some of the concerns, it suffers from a fatal flaw:

It isn't secure. It is usually trivial to guess the ID number, email account, other identifier. An attacker could sign up with someone else's email account and then go to the verification page for that user's account. It might be tempting to use a random ID but randomness implementations are usually not secure.

This scores 3/4 on the validity checking scale

Possibility #4 - Sending a Confirmation Mail + HMAC

The correct solution is to send a confirmation, but include a MAC of the identifier in the verification mechanism (reply, or url) as well. A MAC is a construction used to authenticate a message by combining a secret key and the message contents. One family of constructions, HMAC, is a particularly good choice. This way the url might become http://example.com/verify?email=user@example.com&mac=74e6f7298a9c2d168935f58c001bad88^[4]

Remember that the HMAC is a specific construction, not a naive hash. It would be wise to use a framework native function such as PHP's hash_hmac. Failing to include a secret into the construction would make the MAC trivially defeated by brute force.

This scores 4/4 on the validity checking scale

Closing Notes

Getting email validation right is doable, but not as trivial as many of the existing solutions make it seem.

Note that RFC1123 more specifically spells out that VRFY MUST be implemented but MAY be disabled.

This is not my luggage password.

It is still possible for a auto-reply bot to trigger reply based verification schemes. Bots that click every link in received email are uncommon.

This is HMAC-MD5. It isn't insecure as collisions aren't important for HMAC. I chose it because it is short.

n@ai is a in-use email address by a person named Ian:

%dig +short ai MX
10 mail.offshore.ai.

Thank you to bd for proofreading and reviewing this blog post.

Don't Use Timing Functions for Profiling

2012-11-21T20:54:00.000-08:00

One common technique for profiling programs is to use the gettimeofday system call (with code that looks something like this):

Example (incorrect) code that uses gettimeofday - click to view

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
void function(void)
{
  struct timeval before;
  struct timeval after;
  gettimeofday(&before, NULL);
  codetoprofile();
  gettimeofday(&after, NULL); 
  time_t delta = after.tv_sec - before.tv_sec;
  printf("%ld\n",delta);
}

However, using gettimeofday(2) or time(3) or any function designed to get a time of day to obtain profiling information is wrong for many reasons:

Time can go backwards. In a virtualized environment this can happen quite often. In non-virtualized environments this can happen due to time zones. Even passing CLOCK_MONOTONIC to clock(3) doesn't help as it can go backwards during a leap second expansion.
Time can change drastically for no reason. Systems with NTP enabled periodically sync their time with a time source. This can cause the system time to change by minutes, hours, or even days!
These functions measure Wall Clock time. Time spent on entirely unrelated processes is going to be included in the profiling data!
Even if you have disabled everything else on the system^[1] the delta computed above includes both of User time and System Time. If your algorithm is very fast but the kernel has a slow implementation of some system call you won't learn much.
gettimeofday relies on the cpu clock which may differ across cores resulting in time skew.

So what should be used instead?

There isn't a good, portable, function to obtain profiling information. However there are options for those not tied to a particular system (or those willing to maintain multiple implementations for different systems.

The getrusage(2) system call is one option for profiling data. This provides different fields for user time (ru_utime) and system time (ru_stime) at a relatively high level of precision and accuracy.

Using DTraces profiling provider also seems to be a decent choice although I limited experience with it.

Finally, using APIs meant to access hardware specific features such as FreeBSD's hwpmc is likely to provide the best results at the cost of being the least portable. Linux has similar features such as oprofile and perf. Using dedicated profilers such as Intel's vtunes^[2] may also be worthwhile.

Including networking, background process swapping, cron, etc.
A FreeBSD version is available.

update 2012-11-26: Include note about clock skew across cores.
Update 2013-02-13: Update and fix a massive error I had w.r.t. clock(3)

Finding the majority element in a stream of numbers

2012-10-31T08:41:00.001-07:00

Some time ago I came across the following question.

As input a finite stream stream of numbers is provided. Define an algorithm to find the majority element of the input. The algorithm need not provide a sensible result if no majority element exists. You may assume a transdichotomous memory model.

There are a few definitions which may not be immediately clear:

Stream: A possibly infinite set of data which may not be reused in either the forward or backward direction without explicitly storing it.
Majority element: An element in a set which occurs more than half the time.
Transdichotomous: The integer size is equal to the word size of memory. One does not need to worry about storing partial pieces of integers in separate memory units.

Unfortunately this answer isn't of my own invention, but it is interesting and succinct.

The algorithm (click to view)

Using 3 registers the accumulator, the guess and the current element (next):

Initialize accumulator to 0
Accept the next element of the stream and place it into next. If there are no more elements go to step #7.
If accumulator is 0 place next into guess and increment accumulator.
Else if guess matches next increment accumulator
Else decrement accumulator
Go to step 2
Return the value in guess as the result

An interesting property of this algorithm is that it can be implemented in $O(n)$ time even on a single tape Turing Machine.

Cneonction: closed HTTP header

2012-10-30T12:08:00.002-07:00

When you make a request to certain websites you may find an unusual header that looks a little strange:

[8000 eitan@radar ~ ]%curl -I http://www.imdb.com/ 2>/dev/null|grep close
Cneonction: close
[8001 eitan@radar ~ ]%curl -I http://maps.apple.com/ 2>/dev/null|grep close
Cneonction: close

This isn't a typo though. Some load balancers that sit between the web server and end user want to implement HTTP keep-alive without modifying the back end web server. The load balancer therefore has to add "Connection: Keep-Alive" to the HTTP header and also has to elide the "Connection: close" from the real webserver. However, if it completely removes the line the load balancer (acting as a TCP proxy) would have to stall before forwarding the complete text in order to recompute the TCP checksum. This increases latency on packet delivery.

Instead, the proxy uses a hack to keep the checksum unchanged. The TCP checksum of a packet is the 1s complement summation of all the 16 bit words (the final word might be right padded with zeros).^[1] By manipulating the ordering, but not the content of the header the proxy can avoid changing the TCP checksum except by the fixed amount that the "Connection: Keep-Alive" adds (2061).

In particular:

>>>sum(ord(i) for i in "Connection") - sum(ord(i) for i in "Cneonction")

0

This reordering also keeps the packet size the same.

RFC793

Edit 2012-10-31: Make the RFC a link and remove pointless "2>&1"
Thanks abbe for the inspiration! Thanks wxs for the proofreading.

Reduced Entropy in rand() and random()

2012-10-09T07:49:00.003-07:00

TL;DR: Don't rely on undefined behavior, even when you think it should work.

I recently reported a minor issue to the FreeBSD security team.

The libc random functions had code^1,2 designed to run when /dev/random is not available. This can easily occur in a chroot or jail environment.

if (!done) {
        struct timeval tv;
        unsigned long junk;

        gettimeofday(&tv, NULL);
        srandom((getpid() << 16) ^ tv.tv_sec ^ tv.tv_usec ^ junk);
        return;
}

This code is designed provide a minimal amount of entropy in the "failure" case. Unfortunately, it doesn't even provide the entropy it claims to. This is a minor issue because getpid, getimeday, and a single long variable don't provide a lot of entropy in the first place: (only $log_2{sizeof(long)}$ bits).

The point of the junk value is to add entropy by using uninitialized memory. This relies on the compiler being "stupid" enough not optimize it away.

Unfortunately clang and newer versions of gcc are smart enough to use the undefined behavior in undesired ways.

clang ³ removes any computation which relies on the undefined behavior and so produces the following object code:


 af0:   e8 5f fc ff ff          callq  754 <gettimeofday@plt>
 af5:   e8 7a fc ff ff          callq  774 <getpid@plt>
 afa:   e8 65 fc ff ff          callq  764 <srandom@plt>

Note that the junk variable is entirely unused and that the xor operation between gettimeofday and getpid is non-existent.

gcc 4.6 ⁴ outputs:


 ce8:   e8 03 fa ff ff          callq  6f0 <gettimeofday@plt>
 ced:   e8 4e fa ff ff          callq  740 <getpid@plt>
 cf2:   48 8b 7c 24 08          mov    0x8(%rsp),%rdi
 cf7:   48 33 3c 24             xor    (%rsp),%rdi
 cfb:   c1 e0 10                shl    $0x10,%eax
 cfe:   48 98                   cltq
 d00:   48 31 c7                xor    %rax,%rdi
 d03:   e8 28 fa ff ff          callq  730 <srandom@plt>

Note that in this case the junk value appears to be (%rsp) which isn't all that random.

gcc 4.2⁵ produces the following code

with the junk variable


 d9f:   e8 18 fa ff ff          callq  7bc <gettimeofday@plt>
 da4:   e8 43 fa ff ff          callq  7ec <getpid@plt>
 da9:   48 8b 3c 24             mov    (%rsp),%rdi
 dad:   48 33 7c 24 08          xor    0x8(%rsp),%rdi
 db2:   c1 e0 10                shl    $0x10,%eax
 db5:   48 98                   cltq
 db7:   48 31 c7                xor    %rax,%rdi
 dba:   48 31 df                xor    %rbx,%rdi
 dbd:   e8 1a fa ff ff          callq  <srandom@plt>

and without:


 d9f:   e8 18 fa ff ff          callq  7bc <gettimeofday@plt>
 da4:   e8 43 fa ff ff          callq  7ec <getpid@plt>
 da9:   48 8b 3c 24             mov    (%rsp),%rdi
 dad:   48 33 7c 24 08          xor    0x8(%rsp),%rdi
 db2:   c1 e0 10                shl    $0x10,%eax
 db5:   48 98                   cltq
 db7:   48 31 c7                xor    %rax,%rdi
 dba:   e8 1d fa ff ff          callq  7dc <srandom@plt>

The base version of gcc isn't vulnerable. However, with the upcoming switch to the clang compiler the FreeBSD does become vulnerable.

The first proposed fix was to add the volatile type qualifier to the junk variable. ~~While this seemed to fix the code generation issue I didn't believe this to be a valid fix as the behavior is still undefined⁶~~ (I misread text of the standard). Additionally, the value is likely have be very predictable. A preference was expressed to remove the junk variable as using it may leak a small amount of stack data.

I proposed the simple and obvious fix of removing the use of the junk variable⁷

In a brief survey of other libraries I noticed similar issues. I will attempt to notify the vendors

It should be obvious, but undefined behavior is undefined and can't be relied on to ever to give a sensible result.

random.c r165903 (line 316)

rand.c r241046 (line 131)

FreeBSD clang version 3.1 (branches/release_31 156863) 20120523 compiled with -O2 or -O3

gcc46 (FreeBSD Ports Collection) 4.6.4 20120608 (prerelease) compiled with -O2

gcc (GCC) 4.2.1 20070831 patched [FreeBSD] compiled with -O3

sections 5.1.2.2.3 and 6.7.2.4 of ISO9899

svn commit r241373

Edit 2010-10-10: update the paragraph referring to undefined behavior of volatile.

NFS Mount Network Booting VirtualBox

2012-10-08T19:16:00.002-07:00

In order to test modifications to the FreeBSD kernel, I would like to boot a diskless virtual machine. The goal is to be able to quickly test changed I make without needing to recreate the VM each time, or run the installation procedure inside of the VM.

You can elect to put the root anywhere you want. For this example, I will be using /home/vm as the root of the virtual machine.

There are a few aspects to this:

The filesystem that will be booted.
VirtualBox setup
Virtual machine setup
DHCP server
Host setup

Filesystem Installation

Create the distribution:

cd /usr/src 
make installworld installkernel distribution DESTDIR=/home/vm

Create the /conf directory used for diskless boot. See /etc/rc.initdiskless for details.

# cd /home/vm/
# mkdir -p conf/base/etc
# cd conf/base/etc
# cat diskless_remount 
/etc
# cat fstab 
md    /tmp    mfs    -s=30m,rw    0 0
md    /var    mfs    -s=30m,rw    0 0
# cat md_size 
256m

VirtualBox Setup

Create a "host-only" interface with DHCP disabled. To do this:

Open up the preferences screen and go the "network" tab.
Create a new host-only network (the green plus on the right).
Disable the DHCP Server.
Select an IP Address in the range your VM will have.

Note that the DHCP server will continue to run until killed or the machine rebooted:

## Note that you may kill too much here. Be careful.
# pgrep -fl DHCP
# pkill -15 DHCP

Create the Virtual Machine

Create a new virtual machine. Make sure to select "FreeBSD - 64 bit". Note that you should create this machine without any disk (and ignore the warning at the end).
Open up the virtual machine's settings
Select only the "Network" option under boot order (System->Motherboard->Boot Order). Also check "Hardware clock in UTC time."
Under network select the Host Only Adapter created before. Under advanced options, change the adapter type to "PCNet-PCI II." Also make a note of the mac address of the virtual machine. It will be needed later.

DHCP Server

For simplicity I opted to use a really simple DHCP server dnsmasq:

Install dnsmasq

$ make -C /usr/ports/dns/dnsmasq install clean

Configure dnsmasq to server as a tftp and DHCP server for the virtual machine interface. Modify /usr/local/etc/dnsmasq.conf (I've bolded the parts that are configuration specific):

interface=vboxnet0
dhcp-range=172.16.100.100,172.16.100.200,48h^[1]
#Note that this mac address is the mac address of the vm noted earlier.
dhcp-host=01:02:03:04:05:06,vm-testing,172.16.100.100,set:diskless
dhcp-boot=tag:diskless,/home/vm/boot/pxeboot
dhcp-option=tag:diskless,option:root-path,"/home/vm/"
enable-tftp

Add this to /etc/dhclient.conf so that you can reference your virtual machine by name:
```
precede domain-name-servers 127.0.0.1;
```

Host Setup

Add the following to /etc/rc.conf:

# Required for VirtualBox
devfs_system_ruleset="system" 

# Required for diskless boot
dnsmasq_enable="YES"

# NFS Exports
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_enable="YES"
mountd_flags="-r"
nfs_client_enable="YES"
weak_mountd_authentication="YES"

Add the directory to /etc/exports:

/home/vm -ro -alldirs -network 172.16.0.0 -maproot=root

Now you are all set. Just boot the virtual machine from VirtualBox and watch it go!

Note that the 172.16.0.0/20 network is RFC 1918 private address space.

#!/bin/bash considered harmful

2012-10-03T10:08:00.000-07:00

When one writes a shell script there are a variety of shebang lines that could be used:

#!/bin/sh
#!/usr/bin/env bash
#!/bin/bash

or one of many other options.

Of these only the first two are possibly correct.

Using #!/bin/bash is wrong because:

Sometimes bash isn't installed.
If it is installed, it may not be in /bin
If it is in /bin, the user may have decided to set PATH to use a different installation of bash. Using an absolute path like this overrides the user's choices.
bash shouldn't be used for scripts intended for portability

If you have bash specific code use #!/usr/bin/env bash. If you want more portable code try using Debian's checkbashism to find instances of non-POSIX compliant shell scripting.

Some git-svn notes

2012-10-01T07:45:00.002-07:00

When working with a subversion repository I often miss the use git features. However, it is possible for git to speak the subversion protocol:

Cloning the initial repository


$git svn clone svn://svn.freebsd.org/base/head

Resuming an interrupted git svn clone

It is really annoying when you start a git svn clone process overnight and come back to find that it stopped in the middle. Luckily, there is a really simple way to recover - without spending hours to redownload what you already have.


$git svn fetch

$git svn rebase

Committing the final patch

There are a lot of workflows for this, but I prefer the cherry-pick approach:


$git checkout master

$git svn rebase

$git cherry-pick commit-id

$git svn dcommit

Finding the min and max in 1.5n comparisons

2012-09-03T13:06:00.000-07:00

A friend of mine recently gave me the following problem:

Given an unsorted set of numbers find the minimum and maximum of set in a maximum of $1.5n$ comparisons.

My answer involves splitting the list up pairwise and finding the result on the only half of the set.

Go through list and compare every even index to its immediate right (odd) index. Sort each pair numerically within itself. This step takes $\dfrac{1}{2}n$ comparisons.
Find the minimum of every odd index and find the maximum of every even element using the typical algorithm. This step takes $n$ comparisons.

Note that this could be done in one pass by doing the pair comparison and the min/max comparison in one pass.

Is there a better way?

Blogging my way through CLRS Section 11.1 (edition 2)

2012-05-07T20:47:00.001-07:00

I've taken a brief break from blogging about my Cormen readings but I decided to write up the answers to chapter 11. Note that the chapters and question numbers may not match up because I'm using an older edition of the book.

Question 11.1-1:
Suppose that a dynamic set $S$ is represented by a direct address table $T$ of length $m$. Describe a procedure that finds the maximum element of $S$. What is the worst case performance of your procedure?

Assuming the addresses are sorted by key: Start at the end of the direct address table and scan downward until a non-empty slot is found. This is the maximum and if not:

Initialize $max$ to $-\infty$
Start at the first address in the table and scan downward until a used slot is found. If you reach the end goto #5
Compare key to $max$. If it is greater assign it to $max$
Goto #2
Return $max$

The performance of this algorithm is $\Theta(m)$. A slightly smaller bound can be found in the first case of $\Theta(m - max)$

Question 11.1-2:
Describe how to use a bit vector to represent a dynamic set of distinct elements with no satellite data. Dictionary operations should run in $O(1)$ time.

Initialize a bit vector of length $|U|$ to all $0$s. When storing key $k$ set the $k$th bit and when deleting the $k$th bit set it to zero. This is $O(1)$ even in a non-transdichotomous model though it may be slower.

Question 11.1-3:
Suggest how to implement a direct address table in which the keys of stored elements do not need to be distinct and the elements can have satellite data. All three dictionary operations must take $O(1)$ time.

Each element in the table should be a pointer to the head of a linked list containing the satellite data. $nul$ can be used for non-existent items.

Question 11.1-4:
We wish to implement a dictionary by using direct addressing on a large array. At the start the array entries may contain garbage, and initializing the entire array is impractical because of its size. Describe a scheme for implementing a direct address dictionary on the array. Dictionary operations should take $O(1)$ time. Using an additional stack with size proportional to the number of stored keys is permitted.

On insert the array address is inserted into a stack. The array element is then initialized to the value of the location in the stack.

On search the array element value is to see if it is pointing into the stack. If it is the value of the stack is checked to see if it is pointing back to the array.^[1]

On delete, the array element can be set to a value not pointing the stack but this isn't required. If the element points to the value of the stack, it is simply popped off. If it is pointing to the middle of the stack, the top element and the key element are swapped and then the pop is performed. In addition the value which the top element was pointing to must be modified to point to the new location

Question 11.2-1:
Suppose we have use a hash function $h$ to hash $n$ distinct keys into an array $T$ of length $m$. Assuming simple uniform hashing what is the expected number of collisions?

Since each new value is equally likely to hash to any slot we would expect $n/m$ collisions.

Question 11.2-2:
Demonstrate the insertion of the keys: $5, 28, 19, 15, 20, 33, 12, 17, 10$ into a hash table with 9 slots and $h(k) = k \mod{9}$^[2]

hash	values
1	28 -> 19 -> 1
2	20
3	12
5	5
6	15 -> 33
17	8

Question 11.2-3:
If the keys were stored in sorted order how is the running time for successful searches, unsuccessful searches, insertions, and deletions affected under the assumption of simple uniform hashing?

Successful and unsuccessful searches are largely unaffected although small gains can be achieved if if the search bails out early once the search finds a key later in the sort order than the one being searched for.

Insertions are the most affected operation. The time is changed from $\Theta(1)$ to $O(n/m)$

Deletions are unaffected. If the list was doubly linked the time remains $O(1)$. If it was singly linked the time remains $O(1 + \alpha)$

Question 11.2-4:
Suggest how storage for elements can be allocated and deallocated within the ash table by linking all unused slots into a free list. Assume one slot can store a flag and either one element or two pointers. All dictionary operations should run in $O(1)$ expected time.

Initialize all the values to a singly linked free list (flag set to false) with a head and tail pointer. On insert, use the memory pointed to by the head pointer and set the flag to true for the new element and increment the head pointer by one. On delete, set the flag to false and insert the newly freed memory at the tail of the linked list.

Question 11.2-5:
Show that if $|U| > nm$ with $m$ the number of slots, there is a subset of $U$ of size $n$ consisting of keys that all hash to the same slot, so that the worst case searching time for hashing with chaining is $\Theta(n)$

Assuming the worst case of $|U|$ keys in the hash tabe assuming the optimial case of simple uniform hashing all m slots will have $|U|/m = n$ items. Removing the assumption of uniform hashing will allow some chains to become shorter at the expense of other chains becoming longer. There are more items then the number of slots so at least one slot must have at least $n$ items by the pigeon hole principle.

Question 11.3-1:
Suppose we wish to search a linked list of length $n$, where every element contains a key $k$ along with a hash value $h(k)$. Each key is a long character string. How might we take advantage of the hash values when searching the list for an element of a given key?

You can use $h(k)$ to create a bloom filter of strings in the linked list. This is an $\Theta(1)$ check to determine if it is possible that a string appears in the linked list.

Additionally, you can create a hash table of pointers to elements in the linked list with that hash value. this way you only check a subset of the linked list. Alternatively, one can keep the hash of the value stored in the linked list as well and compare the hash of the search value to the hash of each item and only do the long comparison if the hash matches.

Question 11.3-2:
Suppose that a string of length $r$ is hashed into $m$ slots by treating it as a radix-128 number and then using the division method. The number $m$ is easily represented as a 32 bit word but the string of $r$ character treated as a radix-128 number takes many words. How can we apply the division method to compute the hash of the character string without using more than a constant number of words outside of the string itself?

Instead of treating the word as a radix-128 number some form of combination could be used. For example you may add the values of each character together modulus 128.

Question 11.3-4:
Consider a hash table of size $m = 1000$ and a corresponding hash function $h(k) = \lfloor m (k A \mod{1})\rfloor$ for $ A = \frac{\sqrt{5} - 1}{2}$ Compute the locations to which the keys 61, 62, 63, 64, 65 are mapped.

key	hash
61	700
62	318
63	936
64	554
65	172

This is required because it is possible that the random garbage in the array points to the stack by random chance
unused slots not shown

Blogging my way through CLRS section 3.1 [part 5]

2011-07-10T10:32:00.006-07:00

Part 4 here.
I wrote an entire blog post explaining the answers to 2.3 but Blogger decided to eat it. I don't want to redo those answers so here is 3.1:
For now on I will title my posts with the section number as well to help Google.

Question 3.1-1: Let $f(n)$ and $g(n)$be asymptotically non-negative functions. Using the basic definition of $\theta$-notation, prove that $\max(f(n) , g(n)) \in \theta(f(n) + g(n))$ .

CLRS defines $\theta$ as $\theta(g(n))= \{ f(n) :$ there exists some positive constants $c_1, c_2$, and $n_0,$ such that $0 \leq c_1g(n) \leq f(n) \leq c_2g(n)$ for all $n \geq n_0\}$ Essentially we must prove that there exists some $c_1$ and $c_2$ such that $c_1 \times (f(n) + g(n)) \leq \max(f(n), g(n)) \leq c_2 \times (f(n) + g(n))$ There are a variety of ways to do this but I will choose the easiest way I could think of. Based on the above equation we know that $\max(f(n), g(n)) \leq f(n) + g(n)$ (as f(n) and g(n) must both me non-negative) and we further know that $\max(f(n), g(n))$ can't be more than twice f(n)+g(n). What we have then are the following inequalities: $$\max(f(n), g(n)) \leq c_1 \times (f(n) + g(n))$$ and $$c_2 \times (f(n) + g(n)) \leq 2 \times \max(f(n), g(n))$$ Solving for $c_1$ we get 1 and for $c_2$ we get $\frac {1} {2}$

Question 3.1-2: Show for any real constants $a$ and $b$ where $b \gt 0$ that $(n+a)^b \in \theta(n^b)$

Because $a$ is a constant and the definition of $\theta$ is true after some $n_0$ adding $a$ to $n$ does not affect the definition and we simplify to $n^b \in \theta(n^b)$ which is trivially true

Question 3.1-3: Explain why the statement "The running time of $A$ is at least $O(n^2)$," is meaningless.

I'm a little uncertain of this answer but I think this is what CLRS is getting at when we say a function $f(n)$ has a running time of $O(g(n))$ what we really mean is that $f(n)$ has an asymptotic upper bound of $g(n)$. This means that $f(n) \leq g(n)$ after some $n_0$. To say a function has a running time of at least g(n) seems to be saying that $f(n) \leq g(n) \And f(n) \geq g(n)$ which is a contradiction.

Question 3.1-4: Is $2^{n+1} = O(2^n)$? Is $2^{2n} = O(2^n)$?

$2^{n+1} = 2 \times 2^n$. which means that $2^{n+1} \leq c_1 \times 2^n$ after $n_0$ so we have our answer that $2^{n+1} \in o(2^n)$ Alternatively we could say that the two functions only differ by a constant coefficient and therefore the answer is yes.

There is no constant such that $2^{2n} = c \times 2^n$ and thefore $2^{2n} \notin O(2^n)$

Question 3.1-5: Prove that for any two functions $f(n)$ and $g(n)$, we have $f(n) \in \theta(g(n)) \iff f(n) \in O(g(n)) \And f(n) \in \Omega(g(n))$

This is an "if an only if" problem so we must prove this in two parts:

Firstly, if $f(n) \in O(g(n))$ then there exists some $c_1$ and $n_0$ such that $f(n) \leq c_1 \times g(n)$ after some $n_0$. Further if $f(n) \in Omega(g(n))$ then there exists some $c_2$ and $n_0$ such that $f(n) \geq c_2 \times g(n)$ after some $n_0$.

If we combine the above two statements (which come from the definitions of $\Omega$ and O) than we know that there exists some $c_1, c_2, and n_0,$ such that $c_1g(n) \leq f(n) \leq c_2g(n)$ for all $n \geq n_0\}$

We could do the same thing backward for the other direction: If $f(n) \in \theta(g(n))$ then we could split the above inequality and show that each of the individual statements are true.

Question 3.1-6: Prove that the running time of an algorithm is $\theta(g(n)) \iff$ its worst-case running time is $O(g(n))$ and its best case running time $\Omega(g(n))$.

I'm going to try for an intuitive proof here instead of a mathematical one. If the worst case is asymptotically bound above in the worst case by a certain function and is asymptotically bound from below in the best case which means that the function is tightly bound by both those functions. f(n) never goes below some constant times g(n) and never goes above some constant times g(n). This is what we get from the above definition of $\theta(g(n)))$ A mathematical follows from question 3.1-5.

Question 3.1-7: Prove that $o(g(n)) \cap \omega(g(n)) = \varnothing$

little o and little omega are defined as follows: \[o(g(n)) = \{ f(n) : \forall c > 0 \exists n_0 \text{such that } 0 \leq f(n) \leq c \times g(n) \forall n \gt n_0\] and \[\omega(g(n)) = \{ f(n) : \forall c > 0 \exists n_0 \text{such that } 0 \leq c \times g(n) \leq f(n) \forall n \gt n_0\]

In other words

$$f(n) \in o(g(n)) \iff \lim_{n \to \infty} \frac {f(n)} {g(n)} = 0$$ and $$f(n) \in \omega(g(n)) \iff \lim_{n \to \infty} \frac {f(n)} {g(n)} = \infty$$

It is obvious that these can not be true at the same time. This would require that $0 = \infty$

Blogging my way through CLRS [part 4]

2011-06-29T17:43:00.005-07:00

Part 3 here This set is a bit easier than last time.

Question 2.2-1:Express the function $$\frac{n^3}{1000} - 100n^2 - 100n + 3$$ in terms of $\Theta$ notation

A function g(x) is said to be in the set of all functions $\Theta(x)$ if and only if g(x) is also in the set of all functions $\Omega(x)$ and in the set of all functions $O(x)$.
Symbolically: $$g(x) \in \Theta(x) \iff g(x) \in O(x) \And g(x) \in \Omega(x)$$

A function g(x) is in the set of all functions $\Theta(x)$ if and only if after some constant $c$ it is always true that for some constant C, $g(x) \lt Cf(x)$

A function g(x) is in the set of all functions O(x) if and only if after some constant $c$ it is always true that for some constant C, $g(x) \gt Cf(x)$

With our function we could choose practically any function to satisfy either one of these conditions. However we need to satisfy both of them. One thing that makes this easier is that it only has to be true after some constant number. This allows us to throw away the "trivial" parts that are eventually overwhelmed by the faster growing terms. We therefore are only left with $n^3$, which is the answer.

Question 2.2-2: Consider sorting n numbers stored in an array A by first finding the smallest element and exchanging it with the element in A[1], then find the second smallest element and exchange it with A[2], and continue this for the first n-1 elements of A. Write the pseudocode for this algorithm, which is known as Selection Sort. What loop invariant does this algorithm maintain? Why does it need to run only for the first n-1 elements and not for all n? Give the best case and worst case running times in $\Theta$ notation

This question is asking us to analyze selection sort in a variety of ways. I will start with writing out the pseudocode: for $j \leftarrow 1$ to $n-1$
   min $\leftarrow$ j
   for $i \leftarrow j+1$ to $n$
     $\rhd$ if A[i] < A[min] then min $\leftarrow$ i
   $\rhd$ if min $\neq$ j then swap A[min] and A[j]
A loop invariant that this algorithm maintains is that every elements prior to A[j] is sorted among the subarray A[1] to A[j] and is less than or equal to every element in the subarray A[j+1] to A[n]. I do not believe a stronger loop invariant is provable. The algorithm only needs to run until n-1 because of the second part of the loop invariant. When $j = n-1$ we know that every element after A[j], which is A[n] is not less than all previous elements. Therefore no check has to be done. In the best case (an already sorted array) and in the worst case (a reverse sorted array) the running time is the same: $\Theta(n^2)$

Question 2.2-3: Consider linear search again. How many elements of the input sequence need to be checked on average, assuming that the element being searched for is equally likely to be any element in the array? How about in the worst case? What are the average-case and worst-case running times of linear search in $\Theta$ notation?

The best case for a linear search algorithm is when the searched-for element is in the first location. In the worst case all n locations must be searched. In the average case $\frac{n}{2}$ locations have to be searched.

Question 2.2-4: How can we modify almost any algorithm to have a good best-case running time?

I have no idea what this question is asking for. I guess checking for the optimal case (as in a pre-sorted array for a sorting algorithm) and then skipping the rest of the procedure might work.

Blogging my way through CLRS [3/?]

2011-06-26T09:55:00.005-07:00

part 2 here
According to wikipedia Introduction to Algorithms is also known as CLRS which is shorter (and more fair to the other authors) so I'll use that name for now on.

Question 2.1-1 asks me to redraw a previous diagram, but with different numbers. I am not going to post that here.

Question 2.1-2 Rewrite the insertion sort procedure to sort into nonincreasing instead of nondecreasing order:: Here is the pseudocode of the nonincreasing version of insertion sort: for j $ \leftarrow 2$ to length[A]
  do key$ \leftarrow A[j]$
     $\rhd$ Insert A[j] into sorted sequence A[1..j-1]
    $ i \leftarrow j - 1$
    while $i \gt 0$ AND $A[i] \lt key$
      do $A[i+1] \leftarrow A[i]$
         $i \leftarrow i - 1$
    $A[i+1] \leftarrow key$

Now we prove that this loop correctly terminates with a nonincreasing array to about the same level of formality as the book proved the original.
Initialization: At the first iteration, when $j=2$ the subarray A[1..j-1] is trivially sorted (as it has only one element).
Maintenance: In order to prove maintenance we need to show that the inner loop correctly terminates with an array with "space" for the correct element. As CLRS did not prove this property, I will also skip this proof.
Termination: this loop terminates when j > length[A] or when $j = length[A]+1$. Since we have "proven" (to some level) the maintenance of the loop invariant (that at each point during the loop the subarray [1..j-1] is sorted) we could substitute length[A]+1 for $j$ which becomes [1..length[A]] or the entire array.
This shows that the loop terminates with a correctly sorted array.
Question 2.1-3:
Input:A sequence of $n$ numbers $A = {a_1,a_2,...,a_n}$ and a value $v$.
Output: An index i such that $v = A[i]$ or a special value $\varnothing$ (NIL) if $v \notin A$
Write the pseudocode for Linear Search, which scans through the sequence looking for $v$. Using a loop invariant, prove that your algorithm is correct.: The first part, writing the pseudocode, seems fairly easy:
$r \leftarrow \varnothing$
$j \leftarrow 1$ to length[A]
if $v = A[j] \rhd$ optionally check that $r = \varnothing$
$r \leftarrow j$
return $r$

The second part, proving that this is correct is harder than before because we don't have a trivially true initialization of our loop invariant.
Initialization: $j = 1\ \And\ r = \varnothing$ at the start of our loop. At this point there are no elements prior to A[j] and we have yet to find $v$ in A. As such our invariant (that r will contain the correct value) is true.
Maintenance: At every point in the loop the subarray A[1..j] has either contained $v$ in which case it has been assigned to $r$ or has not contained $v$ in which case $r$ remains $\varnothing$. This means that loop invariant holds for every subarray A[1..j].
Termination: At the end of the loop $j = $ length[A]. We know from our maintenance that $r$ is correct for every subarray A[1..j] so at termination $r$ contains the correct value
Question 2.1-4 Consider the problem of adding two $l$-bit binary integers, stored in two $l$-element arrays $A$ and $B$. the sum of the two integers should be stored in binary form in $(l+1)$-element array $C$. State the problem formally and write pseudocode for adding the integers.: Stating the problem formally looks something like:
Input: Two $l$-bit integers $A$ and $B$ stored as arrays of length $l$ with the most significant bit stored last
Output: An $l+1$-bit integer ($C$) stored as arrays of length $l+1$ with the most significant bit stored last
Here is the pseudocode: $\rhd$ X is a $l$-bit array of bits initialized to all zeros in order to store the carry
for j $\leftarrow$ 1 to $l$
   $C[j] \leftarrow copyC \leftarrow A[j] \oplus B[j]$
   $X[j+1] \leftarrow A[j] \And B[j]$
   $C[j] \leftarrow C[j] \oplus X[j] $
   $X[j+1] \leftarrow copyC \oplus X[j+1] $

Blogging my way through Cormen [2/?]

2011-06-23T12:38:00.003-07:00

As I said in part 1 I am reading a book on algorithms and have decided to blog my way through. My primary goal in doing so is to improve my writing skills. A secondary goal is to force myself to actually answer the questions.

1.2-2 Suppose we are comparing implementations of insertion sort and merge sort on the same machine. For inputs of size n, insertion sort runs in $8n^2$ steps, while merge sort runs in $64n \lg n$ steps. For which values of $n$ does insertion sort beat merge sort?

The question is essentially asking for which values of $n$ is $8n^{2} \lt 64n \lg n$. We can solve this question by first factoring out an $8n$ and we get $n \lt 8 \lg n$ Unfortunately this problem is not solvable using elementary operations. Luckily we are being asked for an integer solution (as computers operate in discrete steps) and we could use the underutilized guess-and-and method.

$n$	$8 \lg n$
14	30.46
41	42.86
43	43.41
44	43.675

So there we have it: given this data we would prefer insertion sort whenever we have fewer than 43 items.

1.2-3 What is the smallest value of n such that an algorithm whose running time is $100n^2$ runs faster than an algorithm whose running time is $2^n$ on the same machine.

This question is asking us find the smallest positive integer $n$ that satisfies $100n^{2} \lt 2^n$. This could be solved by doing the math, by looking at a plot of the curves, or using the above method again. . $$2^{14} = 16384$$ $$(100 \times 14^{2}) = 19600$$ $$2^{15} = 32768$$ $$(100 \times 15^{2}) = 22500$$

Thank you JT and JM for giving me the idea to go through the book, and for looking at my posts before I publish them.
Updated 2012-09-05: I had a brain lapse the day I originally published this and accidentally used the natural logarithm instead of the base 2 log for question 1.2-2. How I ever managed to do that I will not know, but I've fixed it.

Cormen on Algorithms: Blogging my way through [1/?]

2011-06-21T16:02:00.006-07:00

Two of my good friends recently started reading Introduction to Algorithms by Thomas H. Cormen, et. al. Being unable to resist peer pressure I decided to follow and read along.

I plan on blogging my way through the chapters writing my answers to the questions as I go through the book. Like most of my plans they don't always work out, but one could try.

Here it goes!

1.1-1: Give a real-world example in which each of the following computational problems appears: (a)Sorting, (b) Determining the best order for multiplying matrices, (c) finding the convex hull of a set of points.

Sorting - Sorting comes up in virtually every algorithm one could think of. Everything from optimizing monetary investments to efficient compression algorithms has to sort data at some point or another. A harder question might be: Name one non-trivial algorithm that doesn't require sorting.
Multiplying Matrices - graphics and scientific problems frequently require matrix operations.
Convex Hull - Collision detection for use in games, modeling biological systems, or other related work could make use of this

1.1-2: Other than speed what other measures of efficiency might one use in a real-world setting?

It is possible to optimize for (and against) every limited resource. For example minimizing the amount of memory usage is important for embedded applications (and desktop ones too). Reducing total disk I/O is important to increase the longevity of hard drives. On a less technical note optimizing for monetary cost or man hours expended is important too.

1.1-3: Select a data structure you have seen previously and discuss its strengths and limitations

One of the most interesting data structures I know is the Bloom Filter. It is a probabilistic data structure that can determine if an element is NOT in a set but can't determine definitively if an element is in a set. It works by hashing each element in a set to a fixed size bit array. It then ORs the hash with itself (which starts at all zeros). One can test to see if an element is in a set by generating the hash and testing to see if every bit set to 1 in the queried element is set to 1 in the filter. If it is then you have some degree of confidence that the element is in the set. Any negative means that what you are querying for has not been added.
While most probabilistic structures have certain properties in common, bloom filters have a number of interesting pros and cons.

A negative result is definitive - if a query returns that an element has not been added then one knows this to be 100% true.
Since hashes are fixed size the amount of memory a Bloom Filter uses is known and bounded.
Bloom filters can quickly become useless with large amounts of data. It is possible that every bit will be set to 1 which effectively makes the query a NOP.
It is impossible to remove data from a bloom filter. One can't just set all the bits of the hash to a zero because that might be removing other elements as well.
Without a second set of data there is no way to deterministically list all elements (unlike other probabilistic data structures such as Skip Lists).

1.1-4: How are the shortest path and traveling salesmen problems similar? How are they different?

The shortest path problem is: Given a weighted (undirected) graph G:, a start vertex $V_0$ and an end vertex $V_e$, find a path between $V_0$ and $V_e$ such that the sum of the weights is minimized. This could be expanded to $Given a weighted graph G:, find a path between every pair such that the sum of the weights for each path is minimized.
Traveling salesman is defined as:: Given a weighted, undirected, graph G: and a start vertex $V_0$ find a path starting and ending at $V_0$ such that it passes through every other vertex exactly once and the sum of the weights is minimized.

The traveling salesman problem might make use of the shortest path problem repeatedly in order to come up with the correct solution.

1.1-5: Come up with a real-world problem in which only the best solution will do. Then come up with a problem in which a solution that is "approximately" the best will do?

There are very few problems where one needs the objectively optimal solution. Mathematical questions are the only problems I could think of that need that level of accuracy. Virtually every problem needs a good enough solution. Some examples include finding a fast route for packets on the internet or locating a piece of data in a database.

update 2011-06-30: modified text of answers 1.1-3 and 1.1-5 to be more clear.

Repeating characters in multiple languages

2011-02-13T12:42:00.004-08:00

A friend of mine asked me how to repeat a string a specified number of times. There are a few times when ones wants to do this when programing. Here is the "repeating operator" in various languages. I tried to use an operator when possible - but in certain cases I used a function. In all cases I repeat a string followed by a newline.

The BSDs

for i in $(jot 1 5);
do echo -n "Hi";
done;
echo "";

Output:
HiHiHiHiHi

Most Linux distributions

for i in $(seq 1 1 5);
do echo -n "Hi";
done;
echo "";

Output:
HiHiHiHiHi

Perl

print "-" x 10;
print "\n"

Output:
----------

Python

"ab" * 10 Output: 'abababababababababab'

R

paste(rep("Hi",5), collapse='')
Output:
[1] HiHiHiHiHi

Ruby

print "-" * 10;
print "\n"

Output:
----------

Tcl

string repeat "Hi" 5
Output:
HiHiHiHiHi

ZSH

repeat 5 printf 'abc';
echo "";

Output:
abcabcabcabcabc
update 5/30/11: Thanks to Hans I found out that jot is not POSIX. Also fixed formatting.

The Usefulness of the X-Do-Not-Track Header

2011-02-11T20:59:00.006-08:00

Do-Not-Track ^[0] is a recent proposal by the FTC ^[1] to deal with the problem of users being “tracked” by advertisers. This consists of adding a new HTTP header^[2] into page requests that indicates that the user is “opting out” of being “tracked”

The proposal is backed by a number of major players, including Mozilla ^[3] , the Electronic Frontier Foundation ^[4] , Wladimir Palant (the maintainer of of AdBlockPlus)^[5] , and Giorgio Maone (the author of NoScript) ^[6].

Is this a good idea? Does it solve any existing problems?

One important factor to consider is that everyone has a different understanding of the concept of “tracking”. If a user has the header set but logs in to a service is there a difference? What if the user closes the browser in between sessions? Can the service remember who logged on last? Can a bank track a user’s visits for security purposes? What about a quiz website tracking participation to prevent cheating? And these are the simple questions. The definition of the word ‘tracking’ is not officially established.

Google claims it anonymizes IP addresses ^[7] but the “anonymization only involved clearing the last octet of the user’s IP address.^[8] Is that considered tracking? Who decides? You? Google? The government?

Even if we came to a shared definition of what it means to “track”, how can one prove if tracking is done or not?

Let’s imagine that the US government enacts a law requiring websites to follow this header based on this elusive definition of “tracking”. What about servers outside the US? How would their activity be handled? What about a foreign user accessing a US based website? The reverse? What if different jurisdictions came to had two mutually exclusive definitions of “tracking”?

Furthermore, what if websites began to deny service to users that used the X-Do-Not-Track header? Browsers would be forced to remove the header in order to browse the web - effectively nullifying the header’s original purpose.

Arvind Narayanan ^[9] says that “Examining ad blocking allows us to predict how publishers, ... assuming DNT is implemented as a browser plug-in, ad blocking and DNT would be equivalent ... ad blocking would result in a far greater decline in revenue than merely preventing behavioral ads. We should therefore expect that DNT will be at least as well tolerated by websites as ad blocking.” This analysis assumes that the header will be in a plugin or optional setting. If every browser implements this header by default, as they should to attract more users, a much larger percentage of people will be opting out than with ad-blockers today.

What if the law disallowed differing service for those with or without the header? What would be the point? It would make sense to simply disallow “tracking” for all websites, which would make the header moot. Of course, this idea is subject to the same questions as asked above.

Instead of focusing on silly request-based ideas for websites, browser vendors should be working on fixing the privacy holes that have been already been found. Some examples include Firefox’s fix for the CSS history leak, Internet Explorer’s anti-tracking features ^[10]^[11] and related instances

What if browser vendors could consider idea of shipping their browsers with mini versions of ad-preventing software like AdblockPlus, NoScript^[12] , and RequestPolicy^[11] that blocked major third party advertisers such as doubleclick. Of course this could become a cat and mouse game - and it may not be a good idea at all - but it would be more effective than the do-not-track header. Other options include appeasing advertises with targeted user advertising and behavior analysis that doesn’t violate user privacy. For examples see the footnote ^[13]

Quite simply what we need for increased client side awareness of the privacy implications of various features and some form of control given to the users about what data the transmit across the Internet about themselves.

[0] http://donottrack.us/
[1] http://www.ftc.gov/os/2010/12/101201privacyreport.pdf
[2] Originally the header was “X-Behavioral-Ad-Opt-Out: 1 X-Do-Not-Track: 1” but the current version is now “X-DNT: 1” to save bandwidth
[3] https://wiki.mozilla.org/Privacy/Jan2011_DoNotTrack_FAQ
[4] https://www.eff.org/deeplinks/2011/01/mozilla-leads-the-way-on-do-not-track
[5] https://adblockplus.org/forum/viewtopic.php?t=6492
[6] http://hackademix.net/2010/12/28/x-do-not-track-support-in-noscript/
[7] http://searchengineland.com/anonymizing-googles-server-log-data-hows-it-going-15036
[8] http://news.cnet.com/8301-13739_3-10038963-46.html
[9] http://33bits.org/2010/09/20/do-not-track-explained/
[10] http://blogs.msdn.com/b/ie/archive/2010/12/07/ie9-and-privacy-introducing-tracking-protection-v8.aspx
[11] http://blogs.msdn.com/b/ie/archive/2011/01/25/update-effectively-protecting-consumers-from-online-tracking.aspx
[12] These particular addons “break” websites by default, but they can be configured in such a way to limit the damage they cause.
[13] See http://crypto.stanford.edu/adnostic/ Profiling and targeting take place in the browser. The ad network is unaware of the user’s interests

Thank you to JT very much for the sane editing and thoughts provided.