Niklas Wilcke
2015-01-27 17:35:28 UTC
Hi Musicbrainz Developers,
I'm a computer scientist from Germany. At the moment I am developing a
big data deduplication framework for Apache Spark. For test purposes I
dumped [0] the musicbrainz database to a csv file.
If you are interested in a real deduplication of your db please send me
an email. All you need to do is to send me your export query and a short
explanation. You will receive a result csv file with all duplicate
clusters. I appreciate your work and thought I could maybe contribute.
Cheers,
Niklas
[0] \copy (SELECT tr.id, tr.number, tr.name AS title, tr.length, ac.name
AS artist, rc.name AS recording, ruc.date_year AS year, lg.name AS
language FROM track AS tr LEFT JOIN artist_credit AS ac ON
tr.artist_credit = ac.id LEFT JOIN recording AS rc ON tr.recording =
rc.id LEFT JOIN medium AS md ON tr.medium = md.id LEFT JOIN release AS
rl ON md.release = rl.id LEFT JOIN release_unknown_country AS ruc ON
rl.id = ruc.release LEFT JOIN language AS lg ON rl.language = lg.id) To
'/tmp/track-join.csv' With CSV HEADER;
I'm a computer scientist from Germany. At the moment I am developing a
big data deduplication framework for Apache Spark. For test purposes I
dumped [0] the musicbrainz database to a csv file.
If you are interested in a real deduplication of your db please send me
an email. All you need to do is to send me your export query and a short
explanation. You will receive a result csv file with all duplicate
clusters. I appreciate your work and thought I could maybe contribute.
Cheers,
Niklas
[0] \copy (SELECT tr.id, tr.number, tr.name AS title, tr.length, ac.name
AS artist, rc.name AS recording, ruc.date_year AS year, lg.name AS
language FROM track AS tr LEFT JOIN artist_credit AS ac ON
tr.artist_credit = ac.id LEFT JOIN recording AS rc ON tr.recording =
rc.id LEFT JOIN medium AS md ON tr.medium = md.id LEFT JOIN release AS
rl ON md.release = rl.id LEFT JOIN release_unknown_country AS ruc ON
rl.id = ruc.release LEFT JOIN language AS lg ON rl.language = lg.id) To
'/tmp/track-join.csv' With CSV HEADER;