Post by Robert Kaye
However, its working OK, but needs more tuning and a lot more bug
fixing. And its written in Python...
But as far as your approach is concerned, big thumbs up. This is a
great way of doing mass tagging.
prepare for a long mail, this is gonna take some time to write :)
keep in mind that it probably took me 42 times as long to write it than
for you to read it :)
first i'd like to send a big thanks to Robert & Luk?? for their much
appreciated help, i'd be bald by now if it wasn't for their help with
this mail is mostly intended as an example on a different approach of
tagging music. if you intend on using this software to actually tag your
music then you do so on your own risk :)
okay, a bit too long summary of my idea:
as i've mentioned before, i want a simple, automated way of tagging as
much as possible of my music without me lifting a finger.
first i tried using the existing webservice, but i never managed to get
the tracks i wanted and more often than not i had to do several requests
to the webserver to receive all the data i needed. i felt it was
unnecessary to do more than 1 request per track so i decided to look a
bit on this lucene and see if i could come up with an idea to reduce the
amount of requests.
and i came up with this simple idea: put all metadata in 1 field in a
document. then the user submits filename and metadata for a track and
the webservice simply search that single field, not caring about what
text is the artist, what text is the track and so on. and well, it
worked surprisingly well. very often the song i was looking for was the
first hit, and almost always the song was within the first 30 songs
returned. and the great thing about this: the lucene search is damn
fast, virtually the only load you get is when you return the result to
so now i got a webservice which cause ~no load on the server and gives
me exactly what i'm looking for; a list of songs i most likely are
looking for. this was great, now it was up to the client to determine
which of the returned song is the right one, thus putting all the load
on the client rather than the server.
frankly i'd love to stop here, show it to you and claim "this is the
real deal, i tell you!", but something told me i'd never convince alot
of people if i didn't actually give you some "proof" that this is a
"sane" solution. well, the level of sanity can be questioned i guess,
but to move on:
so with the help of a friend, Thomas Adamcik, we continued the work on a
perl tagger we made about a year ago. we got "decent" success with that
tagger back then, but we still had alot of untagged songs. this perl
tagger was easy to modify to do requests to our new webservice instead
of using the musicbrainz library, but our matching "algorithm" we pretty
much had to write all over. i honestly thought writing this matching
"algorithm" would be a piece of cake, well, i was wrong. even so, i
decided to push on until i got something working, i'd come way too far
to give up. i certainly don't regret spending several days on this.
right, let's move on to how it works.
the webservice, called "lousy":
why did i call it "lousy"? that doesn't sound promising...
i'm not sure why, it just popped into my head, think it's "lucene" &
"lossy" combined or something, that along with the frustration of coding
"lousy" is infact a very simple piece of python code. all it does is
receive data from the user, search for the datain the lucene index and
return the x (1 to 100, default 50) tracks to the user.
i did cheat a bit and made the index using a small java program instead
of a python script, but that doesn't matter. it shouldn't be a problem
making the index with a python script.
i've set up this webservice and you can access it here:
go there and you'll get a quick intro on how it works.
further you can get the code i've used for lousy here:
this is a bzr archive so you can branch it (bzr branch
the client side tagger script:
keep in mind that Thomas has primarly worked on this, i don't know it
thoroughly, but enough to get you started.
this script is more advanced than lousy. it reads metadata from
mp3/ogg files, sends the metadata along with the filename to lousy and
match the result returned with the filename/metadata.
instead of babbling any more about it i'll tell you where to fetch it:
it's a bzr archive: bzr branch http://home.samfundet.no/~canidae/tagger/
how to use this stuff:
when i coded this i did at some point want to code it fairly similar to
the existing webservice, but it didn't turn out as nice as i wanted it.
for now you don't need to worry about setting this up, feel free to use
mb.samfundet.no, although keep in mind that i may decide to remove it at
any time, it's only there for testing purposes.
this one requries some modules, a quick grep:
frankly, this module thingy is not my table, but i've not had troubles
getting the script to work without doing any "ugly hacks" (iirc all of
these are in the debian repository (sarge)).
Thomas does however suggest getting a newer version of
MP3::Info/MP3:Tag than the ones in the debian archive. The case may be
the same for Ogg::Vorbis::Header, i know for sure that we've had some
issues saving ogg tags (see patch/libogg-vorbis-header-perl.patch).
ok, let's hope that's settled and move on to how to tag:
1. cd tagger
2. ./bin/fresh.pl -v <path to untagged files>
don't worry, this is a "dry run", your files won't be tagged and moved,
it just shows what it would do. fresh.pl is made for testing, we're
working on making it better.
if you add "--save <path>" it should tag, rename and move your files,
but i do _not_ guarantee that it will work (if you got ogg files then
make sure you read the patch for Ogg::Vorbis::Header) so i recommend you
don't do this on files you value.
results from tests i've done:
i've primarly used this tagger on a set of 986 files, all mp3's iirc.
the files got _horrible_ metadata (if any at all), limited data in the
filename and on top of that, the filenames have been encoded between
utf-8/mac/latin1 and who knows what else.
in other words, this selection is about the worst selection you can come
how long did it take to check 968 songs?
roughly 36 mins on a dual amd mp 2200+ running both lousy and tagger.
how many songs did "tagger" recognize?
532, or roughly 54%.
how many of these songs were tagged wrong?
1. Alf Proysen - Tango for to.mp3
was recognized as "Alf Pr?ysen - Tango for TV".
reason: "Alf Pr?ysen - Tango for to" doesn't exist in my db,
the titles are striking similar.
2. Andrew Lloyd Webber - Phantom Of The Opera.mp3
was recognized as "Andrew Lloyd Webber - Overture".
reason: this is actually not a wrong match, the filename is wrong
3. Beach Boys - Surfin' Usa.mp3
was recognized as "The Beach Boys - Misirlou".
reason: the mp3 is cut off, 30 secs are missing, which happen to
match another song on the album "Surfin' U.S.A."
4. vestlandsfanden - For Livets Glade Gutter.mp3
was recognized as "Vestlandsfanden - Alvefolket".
reason: there are 3 tracks in the db that should match, however
their track length don't match (> +/-5 secs).
on the other hand they also got an album named the same as
the track, which confuses the current matching algorithm.
how many songs were "partially" tagged wrong?
to explain what i mean with "partially tagged wrong":
concider you got a mp3 named "europe - the final countdown.mp3" with no
how many times has this track been released?
how on earth can you possibly know which album this mp3 comes from?
you can't, not even humans can. still, i want it tagged, because the mp3
is perfectly fine (well, except for the fact that it's a mp3), i don't
care which album "tagger" thinks it's connected with, as long as it's
able to recognize the song.
and currently, tagger does. it will tag the song, but it's highly random
which album it gets connected to. this is what i mean with "partially
if you only got a single mp3 this usually won't affect you much, but
let's say i got an entire album of europe, where all tracks got no
metadata and simply are named "<artist> - <track>.mp3" (the agony!).
that's bound to make the songs get tagged on different albums, which
still, in my view, this is better than nothing.
since the songs i'm tagging comes from someone else (who clearly don't
love his/her songs as much as we do) i don't know which album these
songs comes from, and it's impossible for me to determine how many of
them got connected to the wrong album.
if both trackname and albumname is given, then it's alot less common
that this happen.
results from a test with more sane tags (and more files):
due to the huge collection of songs, i've not checked the entire log, it
would take days.
tagged: 6989 (87%)
don't know how many are tagged [partially] wrong, but a very brief
search did not give me the impression that it's any worse than the small
test (that is, i didn't find a single one tagged wrong, but i didn't
look very well either)).
for your amusement i've put out the logs from these two tests:
http://home.samfundet.no/~canidae/scan.log (986 songs, 561k)
http://home.samfundet.no/~canidae/scan2.log (7986 songs, 4.7m)
todo (yes, i'm soon done with this mail):
lousy can be improved. since there's a minor bug with pylucene (or
python or whatever it is) i can't access files greater than 2g. the
first index i built was 2.4g, but instead of indexing every field in the
documents i just indexed the field i search in and managed to push it
just below 2g.
lousy could be improved by:
- not returning hits that don't match the given tracknum (if any)
- only return tracks with a length +/-10 secs from the given length
- use mod_python instead of cgi so it won't open the index each search
- <add your suggestion here>
tagger could be improved by:
- making a more sane "match.pm"
- making a decent interface
- <your suggestion right here>
phew, my head is about toast now, so i'll stop here. i hope you get the
general idea and take the time to look at this despite the clumsy setup.
i wanted to present something for you a couple weeks ago, but it turned
out this was much harder than i anticipated.
since this "documentation" is very rough, not very well formatted and
probably not very helpful to many then do feel very free to ask about
stuff that's left unclear.
do feel even more free to play around with both lousy/tagger and improve
them, just send patches :)
i would however prefer if you mail to this list instead of directly to
me (unless the list admins disagree) as it may be someone else with a
as some of you know i'm usually around on irc, so you can give me a
hilight there as well :)
right, sorry 'bout the long mail. if it helps, it hurts me more than it
hurts you =)