Importing Postgres Data Into Hadoop HDFS
Recently I’ve been getting a crash course in big data. I ran into an issue where the analytics that used a certain table on out Postgres database was so large it would not run. Like to the point where it would grind a 8gb of memory server to a near halt on a single query. While I had gone through several optimizations, it still was not performant enough at the scale needed. In fact it caused havoc with our host due to IO issues. We’re talking 50 million rows or more that need to be analyzed in a scalable way.
Enter Hadoop
I’m sure Hadoop needs no introduction for most, but the great benefits is a scalable way to take massive amounts of data and answer the questions you have.
How do you get existing data into Hadoop?
I had a ton of data that I needed to get into HDFS that was on a Postgres server. I needed a way to dump the data into a useable format so that I could use it properly with Hadoop. Luckily there is an Apache project called Sqoop (yes Sqoop Hadoop is just an amazing combination of words).
Running Sqoop Jobs
Sqoop requires that you have Java and Hadoop installed on the machine that is going to run the job. The great news is that it doesn’t have to be on the same machine as your server. You can run it from any machine, but you typically will run it from a Hadoop cluster. That means it doesn’t matter where your server lives or who if your host doesn’t allow you to install your own packages. You also don’t have to dump the data to the machine that the Sqoop job will be running on. In my case I wanted all my data in S3 as a super portable replacement for HDFS.
Unfotunately, it took a lot of trial and error to get it working. Make sure that you install the proper JDBC driver by putting it on your java classpath or put it in the sqoop directory (usually in /usr/bin/sqoop).
Here’s a local example using Postgres:
sqoop import --connect jdbc:postgresql://localhost:5432/mydb \
--table my_table \
--username myusername \
--password mypassword \
--direct
This will dump the table my_table on a local postgres server. The —direct part is a speedup that’s unique to Postgres and MySql connectors for sqoop.
To dump data directly to S3:
sqoop import --username myusername --password mypassword --table my_table --target-dir s3n://MYS3APIKEY:MYS3SECRETKEY@bucketname/folder/ --hive-drop-import-delims --escaped-by "\\"
Some notes to help your blood pressure:
- Make sure your bucket doesn’t have an underscore in it. For some reason I wan’t able to get it to work if there was an underscore in the bucket.
- Make sure you use trailing slashes. This should help your blood pressure.
- —direct doesn’t work with —escaped-by
- If you have freeform text fields in your data and there is \n it will fuck up your Hadoop job if you try to read new lines of data by splitting on \n. Use the —hive-drop-import-delims to parse those out. I don’t know why it works, this has nothing to do with Hive, it just cleans up your output.
- —escaped-by should always be used if you have freeform text in your data. This will help you parse the data when you run a job because you can ignore important characters (such as a comma) if they are escaped, letting you split data into fields more easily.
Appending Data
What happens if we need to get updated data off of our database? HDFS doesn’t really do an “update” to existing data, but we can append all the new rows added past a certain point. To do that we can use the —incremental argument.
sqoop import --connect jdbc:postgresql://localhost:5432/mydb --username myusername --password mypassword --table my_table --incremental append --check-column id --hive-drop-import-delims --escaped-by "\\"
Append Data To S3 Using Sqoop
I wanted to keep my data independent of the cluster that would run the job. This gives you flexibility to run your jobs wherever, on a cluster or on something like Elastic MapReduce. You have the freedom to build and destroy clusters on a whim because there is no persistant data to lose.
There’s no point in appending if you don’t know the last thing you appended (which row was last to go in the database). Sqoop has a “job” command that will store the last id (or date) of the last items dumped. It then queries only the rows after that point. Very cool!
Unfortunately (again), I couldn’t get this to work by dumping directly to S3 using the —incremental argument so I had to use a workaround. First run the command then use distcp to copy the data to S3. For some reason, incremental appends using S3 as the —warehouse-dir does not work even when specifying a -fs argument.
Hopefully this saves you as much time as I spent figuring it out.
First save the job:
sqoop job --create myjob -- import --connect jdbc:postgresql://myaddress:12345/dbname --table mytable --hive-drop-import-delims --escaped-by "\\" --username myusername --password mypassword --incremental append --check-column id --split-by id
Then run it:
sqoop job --exec myjob
Then copy it in a parallelized way to S3:
hadoop distcp mytable s3n://MY_S3_KEY:MY_S3_SECRET@mybucket/folder/
Boom, data on S3 with no dependency on your cluster living another day! I should note that you will want to use a centralized metastore for saving the last rows run by append. See the sqoop docs for more info.
Django model creation shortcut with kwargs
Let’s say you’ve got this model in Django that you need to create that has a lot of fields. You’ve already mapped out the data, but you think you need to write in each argument in the Model.objects.create method. Thankfully, these methods allow **kwargs to be passed in so you can do this:
# Remember that kwargs is just a dictionary that's mapping
# arguments to values (in this case model fields to values)
MyModel.objects.create(**some_dict)
# Or if we are updating it
MyModel.objects.update(**some_dict)
# Conveniently we could do something like this,
# assuming your field names are mapped exactly to the model names.
# Sometimes you don't want a model form so this works.
MyModel.objects.create(**request.POST)
# Oh, but I have a foreign key! No prob.
MyModel.objects.create(
fk_field = some_object,
**some_dict
)
Django Gotcha: Duplicate models when using get_or_create
When you’re saving some object to the db, there’s a gotcha with DateTimeFields that drove me nuts. It took me way too long to figure this out, but if you have a DateTimeField with the option auto_now_add, it will supersede a datetime object in a Model.objects.get_or_create call. This means that if you are tying to prevent duplicates based on date (say you’re saving some data from an API that has a timestamp) you will still get duplicates. It’s in the docs, but not the implications when using get_or_create. Check out this example:
# Given this simple model
class Foo(models.Model):
name = models.CharField(max_length=100)
date_added = models.DateTimeField(auto_now_add=True)
# This will always be true, even if an instance
# with this name and today's date already exists
bar, created = Foo.objects.get_or_create(
name = 'Alex',
date_added = some_datetime_obj
)
print created
# >> True
# The problem is, auto_now_add does some stuff that
# makes it uneditable, and fucks up my expectations
# when using it with get_or_create
# Here's the solution
class Foo(models.Model):
name = models.CharField(max_length=100)
date_added = models.DateTimeField(default=datetime.today())
bar, created = Foo.objects.get_or_create(
name = 'Alex',
date_added = some_datetime_obj
)
print created
# >> False
Responsive HTML5 Canvas and Processing.js
You know what sucks about html5? You have to specify the canvas element with exact width and height. Out of the box that means none of our fancy responsive/liquid grids will be able to get along well with the canvas. What’s worse is that when you start using something like Processing for graphics/designs, it too demands a specified width and height. When dealing with patterns, if you don’t it’s not fitting in the right dimensions it’s just going to look dumb and at that point just use a background image.
Fuck that. Here’s what I want to do:
- Use a Processing sketch (using processing.js) to create a dynamically created background pattern
- Use a liquid layout (in this example a modified Bootstrap responsive grid) that fits the user’s window size exactly
- When the browser is resized the background pattern should fit precisely in the dimensions of it’s container

Javascript Jquery to the Rescue
This is how we are going to handle it with jquery. When the page loads we are going to resize all of the canvas elements on the page to the width and height of the window (or any percentage based width/height element). We will set a global variable for the width and height which will be used by the Processing sketch when it draws itself on the target canvas. We will either redraw an existing sketch or create a new one if it doesn’t exist.
# CoffeeScript
# Assuming all the dependancies are already loaded
resizeSketches = ->
canvas = $ 'canvas'
# We're using backticks here so that these get set globally
`jswidth = $(window).width()`
`jsheight = $(window).height()`
canvas.attr 'width', jswidth
canvas.attr 'height', jsheight
sketch = Processing.getInstanceById 'id_of_the_canvas_el'
if sketch == undefined
# We need the html element and the sketch to run on it
# I'm using a compiled processing js sketch called triangles
new Proocessing $("#id_of_the_canvas_el")[0], triangles
else
# This is a function I wrote on the sketch to redraw itself
# It's different than the builtin processing redraw method
sketch.restart()
# Somewhere in you code
$(window).on 'resize', resizeSketches
Call this when the page loads and it’s also triggered on the resize event. Lastly we need to update our Processing sketch to set it’s height and width based on the global javascript variables we specified.
Processing.js makes it super easy to share values between your other js code and the processing code. This is probably the most understated feature of processing.js, the ability to mash it up with javascript. In your sketch simply change the setup function by adding the following:
$p.size(jswidth, jsheight);
Easy! Now all we need is a method to restart the sketch which will wipe the existing one and draw it again. Something like…
function restart() {
$p.background(255);
// Reset the size of any other shapes that
// depend on the canvas dimensions then
$p.loop();
Now when the page is loaded and every time it’s resized, the canvas containing a processing sketch will set it’s dimensions and redraw itself. Now go forth and do not fear the fixed dimensions of the html5 canvas.
Another reason why I love Python
I didn’t get a chance to go see all of the amazing galleries in Brooklyn participating in #gobrooklynart this weekend. Browsing their list of artists revealed a lot of great pictures of their respective work. Naturally I wrote a script to parse through all of their artist pages to grab over 5000 pictures of awesomeness. I’m going to put the links I grabbed in a simple infinite scroll to digest the whole thing over this week.
Here’s the quick and dirty script I wrote in about 20 minutes:
# Get all the artists and all the links from Go! Brookly Art
# and put them in a text file for future use
from BeautifulSoup import BeautifulSoup
import requests
import json
pages = range(1, 115)
storage = open("gobrooklynart.txt", 'a')
for page_number in pages:
url = "https://www.gobrooklynart.org/explore/artists?page=" + str(page_number) + "&neighborhoods=&media=Painting%2CPhotography%2CPerformance%2CVideo%2FFilm%2FSound%2CSculpture%2CPrint+Making%2FBook+Arts%2CIllustration%2CMixed+Media%2CTextile+Arts%2CDrawing%2CInstallation%2CDesign%2CFashion%2CCraft&accessibility=&order_by=&keyword="
site = requests.get(url)
html = BeautifulSoup(site.content)
artists = list(set(html.find(id="search_results").findAll('a')))
for artist in artists:
artist_url = "https://www.gobrooklynart.org" + artist['href']
artist_page = requests.get(artist_url).content
artist_html = BeautifulSoup(artist_page)
data = {}
data['name'] = artist_html.find('h2', {'class':'display-name'}).text
images = artist_html.find(id='profile-photos').findAll('img')
data['pictures'] = [i['src'].replace("thumb", "standard") for i in images]
try:
data['homepage'] = artist_html.find(id='studio-website').findAll('a')[0]['href']
except Exception, e:
print e
data['homepage'] = None
print "Got artist %s" % data['name']
storage.write(json.dumps(data) + ", ")
print "On to page next page."
Now I have a txt document with a simple list of artists and pictures of their work. I actually forgot to put the extra “[ ” and “]”, but whatever you get the gist of it. Hooray!
Drawing Triangles and Learning Processing part 1
I’ve spent about 20 hours now just drawing triangles using Processing. Processing is a programming language for drawing things with your computer. Recently, I’ve become more engrossed in the visual side of programming and exploring more of the intersection of the visual arts and technology.
As with learning anything, the best way for me is to actually do something with it and figure it out as I stumble along. I have a concept for a triangle pattern that inspired me and I want to recreate it in Processing so I can make it interactive (more on that in a future post).
Example (pardon the brutal colors):

How do you describe this pattern?
What I realized about visual programming is that I spend much more time thinking about how to explain the problem rather than figuring out how to code it. How do we explain this pattern? It’s a bunch of triangles, but if you just draw the same triangle over and over it won’t get that nice diamond pattern you see. You could break it down by line and you’ll find it’s two patterns that reflect each other. Or you could say that it has zig zags. After much pondering and help from a friend, we came up with an abstraction for explaining. Not as triangles, but as a square.

Imagine a two dimensional grid with a 3 x 3 dot grid. Each of the rows in the list of numbers represents a triangle. This is actually the pattern we see repeating that creates that diamond effect. I can now abstract the x,y coordinates with zeroes and ones that we will translate to real x and y points on the canvas.
There’s probably a mathematical formula here that could do this too, but that’s beyond my attention span. Instead we have a perfectly good abstraction of a single block that makes up an entire pattern. We can then replicate it on our canvas and achieve our pattern.
The Code
void setup()
{
size(800, 600);
frameRate(1);
noStroke();
}
boolean over = false; // If mouse over
void randomFill()
{
fill(color( random(20, 255),
random(0, 255),
random(100, 255)));
}
void draw_triangle( int dimension, int x1_coord,
int y1_coord, int x2_coord, int y2_coord,
int x3_coord, int y3_coord)
{
// Set the points of the triangle
int x1 = dimension * x1_coord;
int y1 = dimension * y1_coord;
int x2 = dimension * x2_coord;
int y2 = dimension * y2_coord;
int x3 = dimension * x3_coord;
int y3 = dimension * y3_coord;
randomFill();
triangle(x1, y1, x2, y2, x3, y3);
}
// Draw the 8 triangle pattern
void pattern(int x, int y, int dimension)
{
pushMatrix();
translate(x, y);
// Look familiar? It's our triangle point abstraction
draw_triangle(dimension, 0, 0, 0, 1, 1, 0);
draw_triangle(dimension, 0, 1, 1, 0, 1, 1);
draw_triangle(dimension, 1, 0, 1, 1, 2, 1);
draw_triangle(dimension, 1, 0, 2, 1, 2, 0);
draw_triangle(dimension, 0, 1, 0, 2, 1, 2);
draw_triangle(dimension, 0, 1, 1, 1, 1, 2);
draw_triangle(dimension, 1, 1, 1, 2, 2, 1);
draw_triangle(dimension, 1, 2, 2, 1, 2, 2);
popMatrix();
}
void draw()
{
background(100);
fill( 0, 121, 184 );
int dimension = 100;
for (int i = 0; i < 800; i = i+200) {
pattern(0, i, dimension);
pattern(200, i, dimension);
pattern(400, i, dimension);
pattern(600, i, dimension);
}
}
This is just an early sketch to prove we can make this pattern using the square abstraction (it can be cleaned up a good amount too). It’s pretty simple when you break a pattern into larger chunks. We basically had to figure it out once, then replicate it a bunch of times to fill the canvas. Eventually, I’ll add add interactivity and a way to randomly generate the filling of each triangle. I’m also going to do some math to fit the number of columns and rows and the size of the triangles based on the canvas size. I plan on porting this over to processing.js for use on a web page. Cool stuff!
Introducing open-source, project based mentorship
Spoiler:
Project based mentorship is the future. I built http://mentorship.p2pu.org to connect people who want to learn (to code, design, whatever) or mentor by working on really cool projects.
A Promise
I’m a self-taught coder and I’m proud of it. I now make a living doing what I love: building cool shit. I would not have gotten to this point as fast as I did without the help of amazing mentors. In January I made a promise to help anyone who came to me looking for a mentor to help them learn to code. I’ve met some amazing people as a result and have tried my best to help make good on that promise.
I haven’t done enough. So I’m doing more because I believe in this.
I teamed up with P2PU, a non-profit, helping people learn from their peers online. I shared my story with them, what I believe in, and they gave me an amazing opportunity to team up with them in Berlin, where I built the P2PU Mentorship platform.
Project Based Mentorships is the Future of Learning
Unfortunately, just matching people with mentors based on what they are learning doesn’t work. I know this because I did it. By hand. Interviewing everyone, matching people together, sending intro emails, etc. It’s scary for most people. You agree to work for an indefinite amount of time with someone you don’t know.
A better way to do mentorships:
- Create a project, something cool you want to work on
- Collaborate with people looking to learn or mentor
- Work on the project and learn along the way
Here’s an example:
- I want to learn to build interactive websites
- I create a project on P2PU Mentorship to build a cool site I’ve been dying to make
- Mentors with the skills I want to learn join my project and help guide me
- I work on the project and learn by doing with awesome mentors
- I’m working on something really cool, but I need help
- I create a project on P2PU Mentorship to work with people who want to learn and I will help guide them kind of like an academic advisor
- Learners join my project and we work on it together
Projects are the perfect opportunity to make mentorship happen online.
- You’ll know what your getting into (you join project you’re interested in)
- They are have a logical end (when the project is finished)
- Learners get to produce something they are proud of, work with experienced people, learn new skills
- Mentors benefit from more people working on their project, fresh ideas, and, of course, the joy of helping others
If you believe in online learning and helping others please join
If this sort of thing gets you excited, please join and help me pay it forward. Start a cool project and help make a difference.
Also if you are looking to learn Python and web development, the mentorship site itself is a project you can join led by yours truly. Check it out on github.
A simple real-time chat server using Clojure and Aleph
I’ve been exploring some asynchronous servers and have recently been enamored by Clojure, a modern lisp with brilliant built in concurrency primitives. In the Clojure world, your only real choice for an async server is Aleph. It’s built on top of an event-driven abstraction library called Lamina (from the same authors) which makes for a handy non-blocking server ready for realtime web apps powered by websockets, UDP, or TCP.
For this example I was mainly interested in a simple websockets web app with url routes using Compojure, an un-opinionated web framework in Clojure. The premise is simple; a public chatroom around a url (i.e /chat/hello has it’s own room and messages added there don’t show up on /chat/another and visa versa). I borrowed some of the code and patterns from another aleph example app you also might want to check out.
You can get the full working(ish) source here: https://github.com/alexkehayias/clojure-aleph-chat
(ns core.main
(:use lamina.core
aleph.http
compojure.core
(hiccup core page))
; This sets the correct file type for js includes used by hiccup
(ring.middleware resource file-info)
core.views
core.templates)
(:require [compojure.route :as route])
(:gen-class))
(defn page []
"HTML page rendered using Hiccup. Includes the js we need for websockets."
(html5
[:head
(include-css "/static/stylesheets/master.css")]
[:body
[:div.container
[:div.row
[:div.columns.twelve
[:p [:h1#headline "Chat"]]
[:form
[:input#message {:type "text"}]
[:input.nice.large.blue.button {:type "submit"}]]
[:div#messages]]]]
(include-js "http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js")
(include-js "/static/javascripts/web_socket.js")
(include-js "/static/javascripts/app.js")]))
(defn sync-app [request]
"Rendered response of the chat page"
{:status 200
:headers {"content-type" "text/html"}
:body (page)})
(def wrapped-sync-app
"Wraps the response with static files"
(-> sync-app
(wrap-resource "public")
(wrap-file-info)))
(defn chat-init [ch]
"Initialize a new chat channel"
(receive-all ch #(println "message: " %)))
(defn chat-handler [ch room]
"Relays messages into a chat room. If it doesn't
exist create a new channel"
(let [chat (named-channel room chat-init)]
(siphon chat ch)
(siphon ch chat)))
(defn chat [ch request]
"View handler that handles a chat room. If it's not
a websocket request then return a rendered html response."
(let [params (:route-params request)
room (:room params)]
(if (:websocket request)
(chat-handler ch room)
(enqueue ch (wrapped-sync-app request)))))
(defroutes app-routes
"Routes requests to their handler function. Captures dynamic variables."
(GET ["/chat/:room", :room #"[a-zA-Z]+"] {}
(wrap-aleph-handler chat))
(GET ["/"] {} "Hello world!")
;;Route our public resources like css and js to the static url
(route/resources "/static")
;;Any url without a route handler will be served this response
(route/not-found "Page not found"))
(defn -main [& args]
"Main thread for the server which starts an async server with
all the routes we specified and is websocket ready."
(start-http-server (wrap-ring-handler app-routes)
So what’s happening here?
We’re taking a request to the URI /chat/<room> and creating a channel which other visitors to /chat/<room> subscribe to via a websocket. When a message is received to that URI it goes to the chat-handler, realizes it’s a websocket request, and adds the message to the named-channel that has that chatroom name. The message is pushed to all the subscribers of that channel (everyone on the page /chat/<room>) and the message is rendered on the page using javascript. There is only one channel that is persisted here that everyone is sharing.
All we need is some javascript to set up the websocket and handle messages. This is in the directory resources/public/javascripts:
// Note: this was generated from coffeescript and ƒ stands for function (lol emacs)
(ƒ() {
$(ƒ() {
window.socket = new WebSocket(window.location.href.replace("http://", "ws://"));
socket.onopen = ƒ() {
return console.log("socket opened");
};
socket.onmessage = ƒ(msg) {
return $("#messages").append("<p>" + msg.data + "</p>");
};
return $("form").on("submit", ƒ(e) {
e.preventDefault();
socket.send($("#message").val());
return $("#message").val("");
});
});
}).call(this);
There you have it, a simple, no-frills public chat room based on the page you’re on. This was just a test, but you can see how easy it is to set up a non-blocking async server with a little help from Aleph. Combined with the concurrency of Clojure itself, you can have an async, concurrent, functional programming love fest. Boners.
My pointless quest for the perfect web stack
Recently I’ve been on the blasphemes quest to find my ideal web stack.
What I Want
Modern web applications are pushing the envelope of the experiences that can be made over the internets. I use Python/Django/Postgres/Backbone/SASS at the moment and that’s all well and good. But what I want now is a concurrent language and a non-blocking web server to serve an API. Why? Because most of my web apps now are mostly client-side using lots of javascript and backbone. I want concurrent because I want speed for powerful algorithms I will theoretically write. I want an asnychronous web server because I’m tired of the Python -> Rabbitmq -> Celery dance I need to do to offload tasks into a queue.
I realize this list of stuff I want is ridiculous since I don’t actually know what I want to build with this theoretical stack. So keep that in mind that this is a pointless exercise, but one that I’ve been thinking about nonetheless. It’s also extremely narrow minded and not inclusive of all the stuff out there.
WINNER:
Not necessarily what I use now, but will use next.
Clojure/Aleph -> Mongo -> Backbone -> Handlebars -> SASS
BACKEND LANGUAGE:
1) Clojure
- Concurrent
- functional
- lispy
- fast
- Still new, smaller community
- Huge learning curve for me
2) Python
- A caveman could write it
- Tons of libraries
- Proven
- GIL sucks :(
3) Javascript
- CoffeeScript makes it awesome
- V8 makes it super fast
- Possibility of one language for front and back end
- single threaded
DB:
1) Mongo
- Schema-less
- easy to query
- restful api (not sure how useful this is out of the box)
- possible difficulties in scaling (anecdotal, but frequent)
2) Postgres/MySql
- Relational
- reliable
- scalable
- migrations can be a pain
WEB FRAMEWORK/SERVER:
1) Aleph (clojure)
- The only real choice for non-blocking server in clojure
- Websocket support
- Works with Noir web framework
- Socket.io not supported
2) Flask (python)
- Super minimal
- Simple to set up
- It’s python
3) Express (node/javascript)
- Lightweight
- Readable
- Large community and resources
- Non-blocking
- Easy Socket.io support
4) Django (python)
- Batteries included
- So many libraries makes rapid development easy
- Get’s in your way after awhile (restrictive)
- Synchronous unless you use Celery as a external task queue
- Overkill for the kinds of web apps I write (Backbone, heavy client side)
FRONT END FRAMEWORK:
1) Backbone
- Makes front-end feel like backend “mvc”
- Namespacing keeps views separate
- Non-prescriptive, use it how you want
- Sometimes confusing because there is no right way to use it
TEMPLATING:
1) Handlebars
- Separation of view logic (as much as possible)
- Helpers let’s you write your own template tags
- Multiple options for using templates
2) Moustache
- Lacks helpers
CSS:
1) SASS
- It’s just a thousand times better than writing cross-browser pure css
And there you have it. My ideal stack that is incredibly flexible and powerful to create awesome shit I haven’t thought of yet.

