Data Mining on FreeDB – Part 1

The freedb database is a very large collection of CD descriptions (mainly music CDs) submitted by many people over the years. The history of freedb can be viewed in the Wikipedia article

Downloading the database

The entire database can be downloaded from the freeDB site.

Preparing the data

To prepare the data for processing after downloading the database, you first need to uncompress and extract it. For example:

tar xjvf freedb-complete-20141101.tar.bz2

The above will take a long time since every entry is stored as a file and there are many entries.

The files will be extracted to the following directories:

data
blues
classical
country
folk
jazz
misc
newage
reggae
rock
soundtrack

Note that there are many more genres than the above division implies. Each file has a field named DGENRE in which the submitter enters their impression of the genre of the CD. The directories seem to be a coarse division according to what that field contains.

Next, if you want to do some data mining on this endless sea of files, my recommendation is to first aggregate them into one big file. There are many ways to go about this – I wrote a simple perl script but it might be easier to use a bash script or even a batch file (if you happen to run Windows).

If you want to save the trouble, just save the following script as bigfile.pl

my $dir           = $ARGV[ 0 ];

if ( $dir eq "" )
{    
    print( "usage: perl bigfile.pl <genre-path>" );
    exit();
}

use File::Find;

open( OUT,">${dir}.txt");
find(&edits, $dir);
close( OUT );

my $nFields = 0;

sub edits()
{
    $strPath = $File::Find::name;
    # Next file name is now $_

    open( FL, "$strPath" );
    
    while (<FL>)
    {
        
        chomp();
        if ( /^DTITLE/ )
        {
            $nFields++;
        }
        elsif ( /^DGENRE/ )
        {
            $nFields++;
        }
        elsif ( /^EXT/ )
        {
            if ( $nFields > 1 )
            {
                print OUT "=EOCD=n";
            }
            $nFields = 0;
            close( FL );
            return;
        }
        
        if ( !/#/ )
        {
            print OUT "$_n";
        }
    }
    
    if ( $nFields > 1 )
    {
        print OUT "=EOCD=n";
    }
    
    $nFields = 0;
    close( FL );
}

Simply run the above script from the root directory to which the files were extracted for each of the directories that were created, for example:

perl bigfile2.pl D:USERDownloadsfreedbblues
perl bigfile2.pl D:USERDownloadsfreedbclassical
perl bigfile2.pl D:USERDownloadsfreedbcountry
perl bigfile2.pl D:USERDownloadsfreedbfolk
perl bigfile2.pl D:USERDownloadsfreedbjazz
perl bigfile2.pl D:USERDownloadsfreedbmisc
perl bigfile2.pl D:USERDownloadsfreedbnewage
perl bigfile2.pl D:USERDownloadsfreedbreggae
perl bigfile2.pl D:USERDownloadsfreedbrock
perl bigfile2.pl D:USERDownloadsfreedbsoundtrack

Each time, a big file with name of the directory + .txt will be created and will include the contents of all the files in the corresponding directory. Again, since there are a zillion files in each directory that need to be aggregated into one, this is a lengthy process.

What the script does is accumulate the important fields for each file followed by a line which marks the end of the CD information. Here’s an example of the content of the country.txt file that is created from scanning the country directory, i.e running: perl bigfile.pl country :

DISCID=0007ec14
DTITLE=Various / Appalachian Breakdown
DYEAR=
DGENRE=Country
TTITLE0=John Henry - Etta Baker
TTITLE1=Sally Goodin - Edd Presnell
TTITLE2=One Dime Blues - Etta Baker
TTITLE3=Railroad Bill - Etta Baker
TTITLE4=Molly Brooks - Richard Chase
TTITLE5=Cripple Creek - Hobart Smith
TTITLE6=Shady Grove Edd Presnell
TTITLE7=Skip To My Lou - Richard Chase
TTITLE8=The Girl I Left Behind Me - Richard Chase
TTITLE9=Johnson Boys - Boone Reid
TTITLE10=Sourwood Mountain - Boone Reid
TTITLE11=Goin Down The Road Feeling Bad - Etta Baker
TTITLE12=Soldier's Joy - Lacey Phillips
TTITLE13=Marching Jaybird - Lacey Phillips
TTITLE14=John Brown's Dream - Hobart Smith
TTITLE15=Pretty Polly - Hobart Smith
TTITLE16=Drunken Hiccups - Hobart Smith
TTITLE17=Bully Of The Town - Etta Baker
TTITLE18=Pateroller Song - Hobart Smith
TTITLE19=Amazing Grace - Edd Presnell
=EOCD=
DISCID=00088a13
DTITLE=Johnny Cash / Complete Columbia Album Collection, Disc 42 [Strawberry Cake]
DYEAR=1976
DGENRE=Country
TTITLE0=Big River
TTITLE2=Doin' My Time
TTITLE4=I Still Miss Someone
TTITLE6=Another Man Done Gone
TTITLE7=I Got Stripes
TTITLE9=Church in the Wildwood
TTITLE10=Medley: Church in the Wildwood-> Lonesome Valley
TTITLE12=Strawberry Cake
TTITLE14=Rock Island Line
TTITLE15=Navajo
TTITLE17=Destination Victoria Station
TTITLE18=The Fourth Man
=EOCD=
DISCID=00098b13
DTITLE=Various Artists / Songs Of The West, Vol 4 (Rhino 1993)
DYEAR=1994
DGENRE=Country
TTITLE0=Al Caiola / The Magnificent Seven
TTITLE1=Al Caiola & His Orch / Bonanza
TTITLE2=Tex Ritter / High Noon (Do Not Forsake Me)
TTITLE3=CBS Orch / Gunsmoke
TTITLE4=Hugo Mentenegro / The Good, The Bad And The Ugly
TTITLE5=ABC Orch / The Lone Ranger (William Tell Overture)
TTITLE6=Gene Autry / South Of The Border (Down Mexico Way)
TTITLE7=Frankie Laine / Rawhide
TTITLE8=Roy Rogers w Spade Cooley & His Western Swing Band / On The Old Spanis
TTITLE8=h Trail
TTITLE9=NBC Orch / Wagon Train
TTITLE10=Marty Robbins / Ballad Of The Alamo
TTITLE11=Orch & Chorus Conducted by David Buttolph / Maverick
TTITLE12=Roy Rogers / Don't Fence Me In
TTITLE13=Johnny Cash / The Rebel-Johnny Yuma
TTITLE14=Johnny Western / The Ballad Of Paladin
TTITLE15=Hugh O'Brian w Ken Darby's Orch & Chorus / Legend Of Wyatt Earp
TTITLE16=ABC Orch Conducted by Herschel Burke Gilbert / The Rifleman
TTITLE17=Bill Hayes / The Ballad Of Davey Crockett
TTITLE18=Changing Channels / Changing Channels
=EOCD=

Finally, once we have all the data conveniently located in 10 files, we can run a script on those files to extract features that we would like to do some data mining on.

One example of such a process is to understand the genres an artist is associated with.
Since each information submission of a CD usually includes the artist name in the DTITLE field and the genre in the DGENRE field, we’ll create a script that extracts only the artist name and suggested genre for each CD submission.

Since the data is entered by humans as free text, there won’t be a standard format that we can count on regarding the contents, so the script will make use of some heuristics (which were determined by browsing the data)

Refining the data

Suppose we wish to see the list of genres an artist was associated with in all the CD submissions for that artist. For that, we’ll extract for each CD dumped to the big files that were prepared in the previous step only the artist name and the genre that was associated with that artist.

The following script does exactly that: Given a name of a large file, it goes over all the CDs in that file and creates another file with the pairs of Artist and Genre for each CD. Unfortunately, since the submissions are not in Unicode, non ASCII entries are ignored.

Save the script below as scangenre.pl:

my $strFile = $ARGV[0];
open( FL, ".\$strFile.txt" );
open( OUT,">.\${strFile}-genres.txt");

my $strArtist;
my $strGenre;
my $nState = 0;
while (<FL>)
{
        chomp();

        if (/[x00-x1Fx7F-xFF]/)
        {
                # only English please...
                next;
        }

        if ( /DTITLE/ )
        {
                # Get artist name
                $nState = 1;
                $strArtist = $_;
                $strArtist =~ s/ /.*$//;
                $strArtist =~ s/DTITLE=//;
                $strArtist =~ s/^s+//;
                $strArtist =~ s/s+$//;
                if ( $strArtist =~ m/Variou/i )
                {
                        $nState = 0;
                }
        }
        elsif ( /DGENRE/ )
        {
                if ( $nState == 1 )
                {
                        # Get genre name
                        $strGenre = $_;
                        $strGenre =~ s/DGENRE=//;
                        $strGenre =~ s/^s+//;
                        $strGenre =~ s/s+$//;
                        if (!($strGenre eq ""))
                        {
                                print OUT "$strArtist|$strGenren";
                        }
                }
                $nState = 0;
        }
}
close( FL );
close( OUT );

For example, to make the Artist/Genre pairs based on the rock.txt file, just run perl scangenre.pl rock

After running the above script on all 10 CD information files, we will have 10 text files which all contain pairs of Artist, Genre. The same artist could have been submitted several times for several different CDs, each time with a genre that the submitter associated with that artist.

In the next part, we will import the Artist,Genre data files to a database so we can have a better look at what this data shows us. Interested? On to part 2

Leave a Reply

Your email address will not be published. Required fields are marked *