# Data Mining on FreeDB – Part 3

### Visualizing correlations and similarities between different Artists ∞

So let’s see what we can already do with the data we have so far.

For one, we can easily try and assess how two artists are similar by the common genres they are associated with and the proportion of votes they received for those genres.

For example, to see to what degree Sting and Bruce Springsteen are similar in this respect, we just run this simple query on the normalised genre distribution table:

```SELECT NGD1.artist,
NGD2.artist,
NGD1.genre,
NGD1.weight AS weight_Sting,
NGD2.weight AS weight_Bruce
FROM normalized_genre_distributions NGD1,
normalized_genre_distributions NGD2
WHERE NGD1.artist='Sting' AND
NGD2.artist='Bruce Springsteen' AND
NGD1.genre = NGD2.genre;```

The above will find the corresponding weights for all the genres that Sting and Bruce Springsteen have in common. And the result is:

We can see that both artists were most associated with Rock music (Bruce more than Sting), and that Sting received more submissions than Bruce that associated him with Pop and Jazz.

Ok, so how do we turn this result into a single “similarity score” between two artists so that we can draw a map of artists which are “closer” and those which are “further away” than the style of a given artist?

Machine learning techniques show that there are quite a few ways of going about this. The one I’ll use here can be easily implemented only in SQL by devising a score for the “distance” between two artists and requesting to see all the artists that are closest to a given artist.

So how to we calculate the distance between two artists? We use the same method used to calculate the distance between two points:

If we have two points on the plane, say a=(1,4) and b=(4, 8), then the distance between a and b is

(this is a good opportunity to thank both Pythagoras and OpenOffice math
formulas)

The same applies when measuring the distance between two points in three dimensions, so if we have two points floating happily in space at coordinates a=(1,3,7) and b=(9,2,10) then the distance between these two points is:

This also applies when measuring the distance between two points in higher dimensions (which we mere mortals can’t visualise in spatial terms). So if there are k dimensions, the distance between point a and b will be:

So all we need to do in order to find the distance between two artists a and b is treat the different genres as different dimensions of a point and calculate the distance between those coordinates (genre weights) that define point a (artist a) and those that define point b (artist b)

One problem though: There are about 35,000 different genres in the database so basically we need to find the distance between two points in a space that has 35,000 dimensions, and on top of that, do it using a simple and fast SQL query.

Well, if you look at the problem closely, you’ll notice that the calculations can be simplified. While the genre space may have 35,000 dimensions, for any two given artists the weight for the vast majority of these dimensions (genres) will be zero, so the distance between the two artists is simply the square root of the squared difference between the weights of the genres which are common to both plus the squared difference between the weights of the genres they do not share with each other.

The SQL query to get the squared difference between the weights of the genres which are common to both artists is simple. However an SQL query to find the squared difference between the weights of genres two artists are associated with but do not share with each other can get a bit hairy (especially with SQLite, which doesn’t support full outer joins), so we need some more simplifying.

To keep things SQL friendly, we’ll first create a helper table that contains the squared weights of all genres associated with an artist:

```CREATE TABLE square_weights AS
SELECT artist, SUM(weight*weight) AS sqr_sum_gweights FROM normalized_genre_distributions
GROUP BY artist;```

This means that for any two given artists we can now retrieve the square weights of both their common and non-shared genres (don’t forget to add indices to both fields).

Now suppose two artists have k genres in common. If we query the sum of the square weights of their respective genres using the above table, this will include the sum of squared weights of the genres they do not have in common (which is what we are after) but will also include the squared sum of the k common genres (which we don’t want) so we need to subtract that term.

For those who like to see it in mathematical notation (not me, but I aim to please):

Suppose there are two artists, artist a and artist b where:
ai is the weight of genre i for artist a
bi is the weight of genre i for artist b

If artist a is associated with m genres and artist b is associated with n genres and the first k genres are present for both artist a and artist b then the distance between artist a and artist b is

If you expand the last term and combine the summations that run from 1 to k, the above distance calculation simplifies to:

Now we have all we need to write the simplest SQL query that will give us the set of artists that are nearest to a given artist – for example, lets see the results for the the 20 artists closest to Tori Amos:

```SELECT NGD1.artist,
NGD2.artist,
SQRT(SW1.sqr_sum_gweights+SW2.sqr_sum_gweights-SUM(2*NGD1.weight*NGD2.weight)) AS distance
FROM square_weights SW1,
square_weights SW2,
normalized_genre_distributions NGD1,
normalized_genre_distributions NGD2
WHERE SW1.artist=NGD1.artist AND
SW2.artist=NGD2.artist AND
NGD1.artist="Tori Amos" AND
NGD1.genre=NGD2.genre
GROUP BY NGD2.artist
ORDER BY distance
LIMIT 20```

And the result is:

Now you might agree or disagree with the above result, but don’t blame the messenger. Take into account that a bunch of coarse subjective genre submissions might not be the best basis on which to build a similarity database. As an example of the inaccuracies involved, check the result below where apparently some artists are more similar to “Beatles” than “The Beatles” and “THE BEATLES” …

Still, using results from this method I was happy to discover artists I wasn’t aware of that resemble the style of other artists I like and have broadened my range of musical exploration and enjoyment.

The next part will be about how to make a browsable music map based on the above techniques. This will bring us to techniques for optimising result retrieval speed, JavaScript graphic engines and other fun stuff.