Posts Tagged ‘ Hadoop

RDBMS In The Social Networks Age

Lorenzo Alberton gave an excellent talk at the London 2010 PHP conference on recursive SQL. The presentation was an explanation, with very clear examples, of how to use recursive common table expressions to solve graph theory problems. Lorenzo’s slides are available on slideshare.

We’re surrounded by recommendation engines, Amazon’s ‘people who bought this also bought this’ and Facebook’s ‘friends of friends’ are two of the examples used. Lorenzo showed that these are graph theory problems – which SQL is well-suited to solve. Indeed finding anybody who is your friend becomes a trivial SQL statement. What’s more complex is finding anyone who is a friend of a friend of a friend… Common table expressions allow you to express the problem simply and elegantly. For example, consider an organisation chart. Everybody on the organisation chart reports to a manager with the exception of the CEO. These relationships can be described in a table with the structure below.

CREATE TABLE orgchart (


boss VARCHAR(10) REFERENCES orgchart(emp),

salary DECIMAL (10,2)


The SQL below demonstrates a  simple recursive common table expression to step through the organisation chart.



FROM orgchart

WHERE boss is NULL



FROM orgchart AS E1

JOIN hierarchy AS E2

ON E1.boss=E2.emp



FROM hierarchy

ORDER By emp;

Lorenzo, of course, takes the example much further showing how to calculate levels with the organisation, reporting lines and so on.

Recursive SQL is an elegant method of solving problems which can be expressed in graph theory; working in a search agency many of our queries identify which websites are linked to from which websites. This approach could help refine a lot of our code. Recursion can be used to solve any computable problem – indeed in functional languages, such as Haskell or LISP even the standard programming loops (for, while, etc) are written in recursive form.

There are, though, a few drawbacks. The first is that although common table expressions were defined in the ANSI SQL-99 standard and have been implemented in DB2, Oracle, MS-SQL server and PostgreSQL they are not available in MySQL. We use MySQL.

The main issue with recursion is stack overflow and, in general, overhead of managing the stack. While functional languages are written to minimise this overhead, it seems unlikely that most database vendors will have been able to achieve the same success with an obscure part of SQL.

Recursion over a relatively large dataset, say a few billion rows, is likely to present a problem. Alternatively using something like Hadoop and map/reduce to break the problem down into, essentially, lots of little problems will be much more efficient but lacks the elegance of the SQL solution. With a large dataset, efficiency is the difference between a query completing in a reasonable time and the query never finishing. Indeed, computing ‘friends of friends’ is one of the applications Facebook uses Hadoop for.