The $300 Raspi Hadoop Cluster


home icon

"If you want to break into Hadoop, it helps to build your own cluster at home"

That was advice given to me at a meetup, but I wasn't thrilled about laying out bunches of money for old hardware taking up lots of space. Then one day, I stumbled upon a webpage about a cluster built out of Raspberry Pi boards, the little $40 computer which is roughly a processor from the 90s (think Pentium 2ish) with peripherals from the 00s.

This seemed like a fun way to get the learning on the cheap, plus it was kinda ridiculous in a cool way. And no one had built one mounted to the network switch, so I could fire up the soldering iron and do something original.

The order came to about $300, all from Amazon. Make sure you get the correct monitor cable and read through the next section to see if you also want to order 4 micro-usb cables.

______________Components______________

#
Picture Link
Text Link
Quantity
Unit Price
1 raspberry pi board Raspberry Pi Board (model B) 4 37.60
2 16gb sd card 16GB Class 10 SD Card 4 9.47
3 cat-6 patch cables Cat-6 Patch Cable 5-pack 1 7.99
4 8-1 network switch 8x1 Gigabit Network Switch 1 41.39
5 12v 6A Switching Laptop Supply Switching Power Supply 1 8.01
6 12v to 5v dc converter DC 12v to 5v Converter 2 8.00
7 nylon standoffs Nylon Standoff Hardware 1 11.00
8 raspberry case Raspberry Pi Case 1 7.99
9 dvi to hdmi cable Dvi to Hdmi Cable 1 7.99


Hardware Notes
  1. To avoid a kludge, I used two 12v-to-5v dc to dc converters to run the PIs off the Netgear supply.
  2. The beefier 6A laptop switching supply ordered for the Netgear used the same connector and tip polarity, a nice surprise.
  3. The holes in the Netgear top case were a real pain to drill, that's some tough metal.
  4. The Raspberry Pi designers were chowderheads for not having four corner standoff holes, but at least it was less drilling.
  5. Also needed to drill two holes in one of the lucite covers. Simply laid the other on top and used it as a guide.
  6. There was a big ole heat sink in the middle of the Netgear board, so I had to attach the dc converters with velcro tape (so professional) .
  7. I planned to wire up to the boards' 5v directly, but there was no way to get on the correct side of the fuse, so changed to hacked-up micro-usb cables at the last minute. But feel free to solder or clip to the header pins.
  8. The local Fry's Electronics had some cheap micro-usb cables, but Monoprice is also your friend.
  9. I was seriously tempted to just buy a 5-port switch and have a single stack of four boards, but 2 by 2 on an 8x1 seemed like a nicer form factor, not to mention potential for future expansion. Probably a good call considering the heat sink mentioned above.

raspi cluster

Assembled Unit



Single Board Mac Interfacing

OK, let's start moving towards software. I used the latest Raspian release for my OS and it's been fine. The standard Raspberry Pi site can direct you on formatting the OS onto an SD card better than I could. The first stop after removing a PI from the box is to powerup with an SD card, usb mouse, usb keyboard and monitor hooked up to the dvi connector. If all goes well, you'll see a big brash Raspberry taking over the monitor. Play around for awhile and have fun, but don't expect a lot of speed.

This is all impressive, and personally staring at the pinkish-purplish raspberry does seem to reduce stress, but eventually we want run headless. This means the Pi board will be networked with another computer and run through an SSH session.

I'm going to switch from Raspberry Pi to Humble Pie for a minute and present you with itToby's webpage on the same subject. It's well written and what I used (thanks a million Toby!). From now on, my stuff will mostly be helper notes to his material:

iToby RasPi Cluster

This may only help some of you, but it took me awhile to figure out a good way of displaying and managing a RasPi graphical SSH session on a 2011 Macbook Air. After crawling around the net, I came up with these steps:

  1. Hit F3 for Mission Control
  2. Mouseover to top-right corner and click + on popup to create new workspace
  3. Click or control-rightArrow to go there
  4. Hold down the Command and Space then enter X11 in the Spotlight search bar (newer Macs use XQuartz, but I haven't tried it)
  5. In shell, enter: ssh -X pi@nodeA     #depending what your names are
  6. lxsession& # ignore the messages
  7. If a desktop with a big goofy raspberry comes up, slide the X11 window to the bottom mostly out of sight
  8. When Hadoop is going later, click Internet > Netsurf Web Browser to monitor cluster
  9. The browser will work even if the usernames are different. I logged in as "pi" because the user "hadoop" can't run graphically for some reason
  10. To return to main Mac screen, click Command-h to hide Pi window then conrol-leftArrow
  11. If it works, you can ditch the monitor, keyboard and mouse

X11 headless setup

X11 Headless Setup



Single Node Setup

For the /etc/network/interfaces file, the fields can be found from:

Java 7 Hotspot from Oracle is now included with Raspian and it performs better than OpenJDK, so it's worth the hassle to get going. See the Problem section below for specifics.

cluster rear view

Rear View



Multi-Node Setup

Just a bunch of small matters here:

  1. Before you create an image, delete the log files. They'll only confuse things after cloning
  2. Also encode all the cluster's IP addresses into /etc/hosts. Doing it piecemeal later is error-prone
  3. Picking higher addresses is better because they're less common. I chose 192.168.2.24X
  4. I used master and slave files plus password-less SSH on all nodes to keep things easy, even though that's not the most secure way
  5. Instructions for putting an image onto an SD card and then cloning from it.  DOUBLECHECK THE NAMES!
	#Archiving SD card
	diskutil list   # make sure card appears as disk1
	sudo dd if=/dev/rdisk1 bs=1m | gzip > ~/Desktop/pi.gz

	#Flashing SD card
	diskutil list   # make sure card appears as disk1
	diskutil unmountDisk /dev/disk1
	gzip -dc ~/Desktop/pi.gz | sudo dd of=/dev/rdisk1 bs=1m

        

nodes in browser

Nodes in Browser



Problems and Fixes I Encountered

1. ARM hardware not supported message when starting Hadoop
Cause: Must run in client mode only, not server. Discussed here (right before the comments): raspberrypicloud
Fix: I made the following changes to the /usr/local/hadoop/bin/hadoop script (not sure it's ideal, but it works):

        elif [ "$COMMAND" = "datanode" ] ; then
        CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode'
        #HADOOP_OPTS=${HADOOP_OPTS/-server/}
        #HADOOP_OPTS=${HADOOP_OPTS/}
        if [ "$starting_secure_dn" = "true" ]; then
                HADOOP_OPTS="-jvm server $HADOOP_OPTS  $HADOOP_DATANODE_OPTS"
        else
                #HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
                HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"
        fi
        

2. Cannot create /fs/hadoop/tmp/dfs/data directory
Cause: /fs/hadoop directory ownership
Fix:

             cd /fs
             sudo chown john:hadoop hadoop 

3. Datanode Daemon Shutting Down Shortly After Starting
Cause: VERSION IDs differing between namenode and datanode

             cat /fs/hadoop/tmp/dfs/name/current/VERSION
             cat /fs/hadoop/tmp/dfs/data/current/VERSION 

Fix: I blow everything out and reformat HDFS, see Serverfault post if you want something less extreme

             sudo rm -r /fs/hadoop/tmp  # all nodes
             hadoop namenode -format   # master only
       
Final Thoughts

So how does it perform? Well, I loaded it up with slightly over 300 Mbytes of text data and ran wordcount across 3 non-overclocked worker nodes. It took 23 minutes. The same data running on my Macbook Air (SSD disk) in psuedo-mode took 1 minute 19 seconds. So it definitely won't break any speed records. Even JPS takes 11 seconds. But remember, you're doing this for the learning.

My favorite thing: unplugging one of the nodes and watching replication in action on the browser.

Also I did this project at the same time as studying for the Cloudera Hadoop Administrator Certification (88% woo-hoo) and they reinforced each other well.

frontal cluster shot

Front shot

back icon