[mb-devel] Fwd: Musicbrainz database deduplication

Discussion:

Niklas Wilcke

2015-01-27 17:35:28 UTC

Hi Musicbrainz Developers,

I'm a computer scientist from Germany. At the moment I am developing a
big data deduplication framework for Apache Spark. For test purposes I
dumped [0] the musicbrainz database to a csv file.
If you are interested in a real deduplication of your db please send me
an email. All you need to do is to send me your export query and a short
explanation. You will receive a result csv file with all duplicate
clusters. I appreciate your work and thought I could maybe contribute.

Cheers,
Niklas

[0] \copy (SELECT tr.id, tr.number, tr.name AS title, tr.length, ac.name
AS artist, rc.name AS recording, ruc.date_year AS year, lg.name AS
language FROM track AS tr LEFT JOIN artist_credit AS ac ON
tr.artist_credit = ac.id LEFT JOIN recording AS rc ON tr.recording =
rc.id LEFT JOIN medium AS md ON tr.medium = md.id LEFT JOIN release AS
rl ON md.release = rl.id LEFT JOIN release_unknown_country AS ruc ON
rl.id = ruc.release LEFT JOIN language AS lg ON rl.language = lg.id) To
'/tmp/track-join.csv' With CSV HEADER;

Wieland Hoffmann

2015-02-02 13:10:54 UTC

Permalink

Post by Niklas Wilcke
I'm a computer scientist from Germany. At the moment I am developing a
big data deduplication framework for Apache Spark. For test purposes I
dumped [0] the musicbrainz database to a csv file.

This sounds interesting. Did you run the framework against the data as
well? If so, can you share the results openly (or even more details
about the framework itself)?

--
Wieland

Niklas Wilcke

2015-02-02 14:00:45 UTC

Permalink

Hey Wieland,

at the moment I can't share code or details about the framework but my
plans are to release the framework under an open source license in a few
months. This will happen after I finished my master thesis, which is
about this framework.

A huge problem for me is to understand the schema of the musicbrainz db.
It is hard for me to distinguish duplicates and no-duplicates because I
have no experience with the complex schema. At the moment I use the data
exported by the query mentioned in my first mail.

For now the framework is still under construction, so there is no result
at the moment worth sharing. But I would like to share the results, when
you are interested.

What I need is a sensible db query to create a csv file from, which will
be my input data for the deduplication process. Important is, that any
row should be a unique entity and only duplicates are similar. If you
can provide such a query I can process the data and publish a csv file
with duplicate pair ids or something like that.

Regards,
Niklas

Post by Wieland Hoffmann

This sounds interesting. Did you run the framework against the data as
well? If so, can you share the results openly (or even more details
about the framework itself)?