Nearest neighbours using SQL

This explains how I used nearest neighbours via SQL to find similar artists given parameters of a specific artist.

The assumption is that we have a large file with a label (Artist) and a characteristic (Genre), we’ll import all the data to a database for further processing. To keep things simple we’ll use SQLite. The best free UI tool I know of for managing SQLite file is TkSQLite which is available for all platforms.

First Create the table schema for importing the Artist/Genre pairs as follows via Database->Create Table…

Use File->Import->Text File… to import all the artist-genre association files that were generated in the previous step (I’d suggest aggregating them all into one file instead of repeatedly importing each file). Use the following settings to correctly import the CSV file to the table:

Note the above may take some time due to the number of rows involved.

So that queries won’t take forever, add an index to the artist and the genre fields (via Database->Create Index…), for example:

Starting to look at the data

Now that we have all the Artist-Genre associations that were submitted in our database, we can view the genre distribution for each artist. The following shows the query that generates this information for the artist Tom Jones:

SELECT artist,genre,COUNT(genre) AS gcount
FROM genre_assoc
WHERE artist='Tom Jones'
GROUP BY artist, genre
ORDER BY artist, gcount DESC;

and the result is:

To further process this data, it would be best to create this data for all the artists:

CREATE TABLE genre_distributions AS
SELECT artist, genre, COUNT(genre) AS gcount
FROM genre_assoc
GROUP BY artist, genre
ORDER BY artist, gcount DESC;

Remember modify the schema and set gcount to be an INTEGER and also add indices for all the fields.

Now checking out the genre distribution for CD information submitted for the artist Sting returns the following:

One obvious point that can be noted from the above result is that the data should be cleaned if we want to get higher quality results. “Rock”, “rock” and “Rock & Roll” could all be put under one genre. The cleaning process can be partly automated either by scripts or tools made for such tasks, such as OpenRefine (previously Google Refine).

Visualizing correlations and similarities between different Artists

Now let’s see if based only on this data we can make something which is similar (at least in concept if not in quality) to what has been created in liveplasma discovery engine

For this, we’ll want to normalise the absolute number of associations of an artist with a genre so that we can compare two artists which have a similar genre combination but don’t have the same popularity and therefore don’t have a similar number of submissions.
Since I don’t like long nested SQL queries, we’ll achieve this by creating another table that contains the total number of genre submissions for each artist.

CREATE TABLE submissions_count AS
SELECT artist, COUNT(artist) AS total
FROM genre_assoc
GROUP BY artist

Again, after the table is created, change the type of the “total” field to INTEGER and add indices to all the fields.

Now to create the table with the normalized associations of genre types for each artist:

CREATE TABLE normalized_genre_distributions AS
SELECT T1.artist, T1.genre, T1.gcount*1.0/T2.total AS weight
FROM genre_distributions AS T1, submissions_count AS T2
WHERE T1.artist=T2.artist;

Again, change the type of the “weight” field (to FLOAT this time) and add indices to the artist and genre fields.

Now if we search to find Sting’s genre fingerprint, we get:

Now that we have a normalised genre distribution for each artist, we can start asking how similar two artists are to each other based on the genres they were associated with. Interested? On to part 3

Visualizing correlations and similarities between different Artists

So let’s see what we can already do with the data we have so far.

For one, we can easily try and assess how two artists are similar by the common genres they are associated with and the proportion of votes they received for those genres.

For example, to see to what degree Sting and Bruce Springsteen are similar in this respect, we just run this simple query on the normalised genre distribution table:

SELECT NGD1.artist,
       NGD2.artist,
       NGD1.genre,
       NGD1.weight AS weight_Sting,
       NGD2.weight AS weight_Bruce
FROM normalized_genre_distributions NGD1,
     normalized_genre_distributions NGD2
WHERE NGD1.artist='Sting' AND
      NGD2.artist='Bruce Springsteen' AND
      NGD1.genre = NGD2.genre;

The above will find the corresponding weights for all the genres that Sting and Bruce Springsteen have in common. And the result is:

We can see that both artists were most associated with Rock music (Bruce more than Sting), and that Sting received more submissions than Bruce that associated him with Pop and Jazz.

Ok, so how do we turn this result into a single “similarity score” between two artists so that we can draw a map of artists which are “closer” and those which are “further away” than the style of a given artist?

Machine learning techniques show that there are quite a few ways of going about this. The one I’ll use here can be easily implemented only in SQL by devising a score for the “distance” between two artists and requesting to see all the artists that are closest to a given artist.

So how to we calculate the distance between two artists? We use the same method used to calculate the distance between two points:

If we have two points on the plane, say a=(1,4) and b=(4, 8), then the distance between a and b is

(this is a good opportunity to thank both Pythagoras and OpenOffice math
formulas)

The same applies when measuring the distance between two points in three dimensions, so if we have two points floating happily in space at coordinates a=(1,3,7) and b=(9,2,10) then the distance between these two points is:

This also applies when measuring the distance between two points in higher dimensions (which we mere mortals can’t visualise in spatial terms). So if there are k dimensions, the distance between point a and b will be:

So all we need to do in order to find the distance between two artists a and b is treat the different genres as different dimensions of a point and calculate the distance between those coordinates (genre weights) that define point a (artist a) and those that define point b (artist b)

One problem though: There are about 35,000 different genres in the database so basically we need to find the distance between two points in a space that has 35,000 dimensions, and on top of that, do it using a simple and fast SQL query.

Well, if you look at the problem closely, you’ll notice that the calculations can be simplified. While the genre space may have 35,000 dimensions, for any two given artists the weight for the vast majority of these dimensions (genres) will be zero, so the distance between the two artists is simply the square root of the squared difference between the weights of the genres which are common to both plus the squared difference between the weights of the genres they do not share with each other.

The SQL query to get the squared difference between the weights of the genres which are common to both artists is simple. However an SQL query to find the squared difference between the weights of genres two artists are associated with but do not share with each other can get a bit hairy (especially with SQLite, which doesn’t support full outer joins), so we need some more simplifying.

To keep things SQL friendly, we’ll first create a helper table that contains the squared weights of all genres associated with an artist:

CREATE TABLE square_weights AS
SELECT artist, SUM(weight*weight) AS sqr_sum_gweights FROM normalized_genre_distributions
GROUP BY artist;

This means that for any two given artists we can now retrieve the square weights of both their common and non-shared genres (don’t forget to add indices to both fields).

Now suppose two artists have k genres in common. If we query the sum of the square weights of their respective genres using the above table, this will include the sum of squared weights of the genres they do not have in common (which is what we are after) but will also include the squared sum of the k common genres (which we don’t want) so we need to subtract that term.

For those who like to see it in mathematical notation (not me, but I aim to please):

Suppose there are two artists, artist a and artist b where:
ai is the weight of genre i for artist a
bi is the weight of genre i for artist b

If artist a is associated with m genres and artist b is associated with n genres and the first k genres are present for both artist a and artist b then the distance between artist a and artist b is

If you expand the last term and combine the summations that run from 1 to k, the above distance calculation simplifies to:

Now we have all we need to write the simplest SQL query that will give us the set of artists that are nearest to a given artist – for example, lets see the results for the the 20 artists closest to Tori Amos:

SELECT NGD1.artist,
       NGD2.artist,
       SQRT(SW1.sqr_sum_gweights+SW2.sqr_sum_gweights-SUM(2*NGD1.weight*NGD2.weight)) AS distance
FROM square_weights SW1,
     square_weights SW2,
     normalized_genre_distributions NGD1,
     normalized_genre_distributions NGD2
WHERE SW1.artist=NGD1.artist AND
      SW2.artist=NGD2.artist AND
      NGD1.artist="Tori Amos" AND
      NGD1.genre=NGD2.genre
GROUP BY NGD2.artist
ORDER BY distance
LIMIT 20

And the result is:

Now you might agree or disagree with the above result, but don’t blame the messenger. Take into account that a bunch of coarse subjective genre submissions might not be the best basis on which to build a similarity database. As an example of the inaccuracies involved, check the result below where apparently some artists are more similar to “Beatles” than “The Beatles” and “THE BEATLES” …

Still, using results from this method I was happy to discover artists I wasn’t aware of that resemble the style of other artists I like and have broadened my range of musical exploration and enjoyment.