Discussion:
[mb-devel] Data Import Policy
(too old to reply)
Andre Wiethoff
2015-05-11 10:24:57 UTC
Permalink
Hello everybody,

its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?

Basically I would like to work on these two automatic data
crawlings/data minings:

1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively). Also interesting would be links
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?

2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". Basically I would mainly
automatically add similarities between two artists, but of course also
other similarities would be possible (song similarity, etc.). As this
kind of data is highly subjective, it might be a thought whether only
data by automatic web/database data mining would be accepted as input
(and no manual input of users)... This kind of data is available on AMG,
Amazon and BBC, which could be automatically be crawled (and only two
artist IDs would be added to the database as similar). Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.

Are the two scenarios permitted by the data import policy of Musicbrainz
(e.g. doesn't violate any copyright issues, etc.). I think the first
case shouldn't create any problems at all, as using a link all
references are given to the data source (it is just a link).
The second case is a bit more difficult and problematic, as basically
the data is created by the appropriate companies and inserted into a new
database owned by somebody different? On the other hand, only two IDs
are stored (IDs which only makes sense in Musicbrainz) - can that data
violate copyright issues? I think it is a bit similar to Google crawling
and provide the results as their own...

What do you think?

Best regards,

Andre

PS: Here a small evaluation of the URLs stored in the system for some
link targets:

URLs total: 2117159
Discogs total: 567330
Discogs Release: 212163
Discogs Artist: 200194
Discogs Master: 126747
Discogs label: 26405
Allmusic total: 55395
Allmusic Artist: 29943
Allmusic Album: 20290
Allmusic Composition: 4216
Amazon total: 183044
Amazon Product: 182683
Amazon Artist: 229
BBC total: 9805
BBC Artist: 1347
BBC Reviews: 8208

Soundcloud: 26166
Youtube total: 29747
Youtube User Channel: 15127
Youtube Video: 12723

I wonder why there are not more BBC artist links, as these links are
just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
should be pretty easy to add them...
Tom Crocker
2015-05-11 14:20:40 UTC
Permalink
Post by Andre Wiethoff
Hello everybody,
its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?
Basically I would like to work on these two automatic data
1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively).
I think the community appreciate *very* conservative bots. You might want
to check out https://musicbrainz.org/doc/Bots
https://musicbrainz.org/doc/Code_of_Conduct/Bots and
https://github.com/murdos/musicbrainz-bot

Also interesting would be links
Post by Andre Wiethoff
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?
You should put in a ticket for this https://musicbrainz.org/doc/Proposals -
Lyric sites used to be different from all other relationships (they may
still be), needing approval from ruaok
Post by Andre Wiethoff
2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". Basically I would mainly
automatically add similarities between two artists, but of course also
other similarities would be possible (song similarity, etc.). As this
kind of data is highly subjective, it might be a thought whether only
data by automatic web/database data mining would be accepted as input
(and no manual input of users)... This kind of data is available on AMG,
Amazon and BBC, which could be automatically be crawled (and only two
artist IDs would be added to the database as similar). Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.
This sounds dodgy to me - I'd at least want to know what the basis of their
claim of similarity was. Also, be particularly careful what you do with AMG
data! But you could always discuss the general idea on the forums. That's
probably a better place to get wider views of the community. You could also
try the style list, where new relationships used to be discussed.
Post by Andre Wiethoff
<snip>
I wonder why there are not more BBC artist links, as these links are
just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
should be pretty easy to add them...
Sounds like a good plan. I guess no-one's got around to it or thought the
beeb might do it themselves
Ian McEwen
2015-05-11 14:32:30 UTC
Permalink
Post by Tom Crocker
Post by Andre Wiethoff
Hello everybody,
its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?
Basically I would like to work on these two automatic data
1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively).
I think the community appreciate *very* conservative bots. You might want
to check out https://musicbrainz.org/doc/Bots
https://musicbrainz.org/doc/Code_of_Conduct/Bots and
https://github.com/murdos/musicbrainz-bot
Also interesting would be links
Post by Andre Wiethoff
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?
You should put in a ticket for this https://musicbrainz.org/doc/Proposals -
Lyric sites used to be different from all other relationships (they may
still be), needing approval from ruaok
Post by Andre Wiethoff
2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". Basically I would mainly
automatically add similarities between two artists, but of course also
other similarities would be possible (song similarity, etc.). As this
kind of data is highly subjective, it might be a thought whether only
data by automatic web/database data mining would be accepted as input
(and no manual input of users)... This kind of data is available on AMG,
Amazon and BBC, which could be automatically be crawled (and only two
artist IDs would be added to the database as similar). Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.
This sounds dodgy to me - I'd at least want to know what the basis of their
claim of similarity was. Also, be particularly careful what you do with AMG
data! But you could always discuss the general idea on the forums. That's
probably a better place to get wider views of the community. You could also
try the style list, where new relationships used to be discussed.
I agree; this is far too vague to be used properly, if there even is a
clear definition of using such a thing properly.
Post by Tom Crocker
Post by Andre Wiethoff
<snip>
I wonder why there are not more BBC artist links, as these links are
just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
should be pretty easy to add them...
Sounds like a good plan. I guess no-one's got around to it or thought the
beeb might do it themselves
BBC Music links aren't to be added unless they have substantial content
that isn't just from MusicBrainz, because obviously if you want a BBC
Music link you can just construct them, so it's really only useful if
there's something there more than on MB.

See
https://musicbrainz.org/relationship/d028a975-000c-4525-9333-d3c8425e4b54

So it's not simple to add them automatically, and this fact also
explains the limited numbers to some degree.
Post by Tom Crocker
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Paul Taylor
2015-05-11 15:06:17 UTC
Permalink
Post by Andre Wiethoff
Hello everybody,
its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?
Basically I would like to work on these two automatic data
1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively). Also interesting would be links
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?
PS: Here a small evaluation of the URLs stored in the system for some
URLs total: 2117159
Discogs total: 567330
Discogs Release: 212163
Discogs Artist: 200194
Discogs Master: 126747
Discogs label: 26405
Andre

I have a particular interest in improving the links between Musicbrainz
(artist, releases and labels) with Discogs. I took the approach it is
very difficult to correctly link 100% of the time but it is quite easy
to find potential links that are right at least 95% of the time so my
solution was generate potential links and make them available so others
can submit them if they are interested. This takes an artist centric
approach and concentrates on finding links for releases and making it
easier to import releases from discogs as well for any artist.

I do plan to create reports showing potential artist and labels as well.

This is all available at http://albunack.net and might be of interest to
you.

Whilst adding a link doesn't modify existing data, it would be more
useful if additional data from the linked entity was also added but of
course you have to be even sure it is correct. For example if two
releases are matched on various metadata criteria but the MusicBrainz
release did not have a barcode entered and the Discogs one does then it
makes sense to add the Discogs barcode at the same as linking it. It is
possible to do this using the seed release mechanism, what is not
possible is to edit existing data, i.e you can add a barcode but not
modify an existing barcode.

Paul/ijabz
Tom Crocker
2015-05-11 18:30:42 UTC
Permalink
Sorry Paul. I meant to point out your site and say what I've seen of it
seems great. I just wish I could find the time to try it out more!
Paul Taylor
2015-05-11 18:40:57 UTC
Permalink
Post by Tom Crocker
Sorry Paul. I meant to point out your site and say what I've seen of
it seems great. I just wish I could find the time to try it out more!
Tom

thanks, good to have some positive feedback - as so far I had just about
zilch

Paul
Andre Wiethoff
2015-05-12 08:19:06 UTC
Permalink
Dear Paul,
Post by Paul Taylor
I have a particular interest in improving the links between Musicbrainz
(artist, releases and labels) with Discogs. I took the approach it is
very difficult to correctly link 100% of the time but it is quite easy
to find potential links that are right at least 95% of the time so my
solution was generate potential links and make them available so others
can submit them if they are interested. This takes an artist centric
approach and concentrates on finding links for releases and making it
easier to import releases from discogs as well for any artist.
I do plan to create reports showing potential artist and labels as well.
This is all available at http://albunack.net and might be of interest to
you.
Whilst adding a link doesn't modify existing data, it would be more
useful if additional data from the linked entity was also added but of
course you have to be even sure it is correct. For example if two
releases are matched on various metadata criteria but the MusicBrainz
release did not have a barcode entered and the Discogs one does then it
makes sense to add the Discogs barcode at the same as linking it. It is
possible to do this using the seed release mechanism, what is not
possible is to edit existing data, i.e you can add a barcode but not
modify an existing barcode.
Thanks for sending the information about your project, I will definitely
have a look at it!

Best regards,

Andre
Frederik “Freso” S. Olesen
2015-05-12 11:37:56 UTC
Permalink
Post by Andre Wiethoff
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?
You've got replies on most stuff, so I'll just skip over those and add
my own comments to some things.
Post by Andre Wiethoff
2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". [
] Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.
This should not be a relationship like we currently have relationships.
This is an entirely subjective piece of information and some people will
consider two artists similar while others will not. There are already
projects (though I forget which, sorry) that group/cluster entities
based on relationships (which IIRC wasn't completely off), so it is
possible to do something like it with the data in MB already. There's
also AcousticBrainz which can be used to cluster entities based on the
acoustic properties of their recordings. If we get scrobbling hooked in
to our end at some point, that'd obviously be another usable source for
this, but it isn't required to get something like this going.
Post by Andre Wiethoff
I think it is a bit similar to Google crawling
and provide the results as their own...
Google does pay at least some of their data sources. I know, for one,
that Google is MetaBrainz' biggest "customer" by far (in terms of how
much money they put in the project).
--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB: https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso
Andre Wiethoff
2015-05-12 14:00:13 UTC
Permalink
Hello Frederik,

thanks for your thoughts!
Post by Frederik “Freso” S. Olesen
This should not be a relationship like we currently have relationships.
This is an entirely subjective piece of information and some people will
consider two artists similar while others will not.
Yes, therefore I did propose to not open it for public editing, but
having it "computer generated" only (by whatever means, be it web
crawling or clustering on large data sets).
Post by Frederik “Freso” S. Olesen
There are already
projects (though I forget which, sorry) that group/cluster entities
based on relationships (which IIRC wasn't completely off), so it is
possible to do something like it with the data in MB already.
I wonder which relationships have been used to group/cluster the
entities? With the existing Metabrainz data, I can think only of
scrobbling data to generate such kind of information...
I found this paper, but they used Musicbrainz only for the basic
metadata retrieval and Audioscrobbler (last.fm) for the similarity
clustering...
http://www.sfu.ca/~shaw/papers/musicianMap-VDA09.pdf
Post by Frederik “Freso” S. Olesen
There's
also AcousticBrainz which can be used to cluster entities based on the
acoustic properties of their recordings.
I don't think that clustering regarding the acoustic properties will
bring any good results for now, I guess this will still take ten years
until there exist something that produces results matchable to a human
expert (or even advanced amateur)...
Post by Frederik “Freso” S. Olesen
If we get scrobbling hooked in
to our end at some point, that'd obviously be another usable source for
this, but it isn't required to get something like this going.
But in the end you agree that the result of such a web crawl/clustering
algorithm/whatever should be stored in the database as final result (for
speedier access of the results) - if implemented at all? But perhaps we
should discuss at first whether the new data would be beneficial for the
users (or the database)...

I thought that relationships would have been the best place to put them,
as in fact it is a relation between e.g. two artists (even though the
definition of similarity would be depend on the algorithm or the page
that is crawled). E.g. Amazon will most probably use the "customers that
buy stuff from this artist also buyed stuff from these other artists"
similarity measurement. I am unsure which measurements are used by AMG
and BBC, but most probably also some kind of clustering algorithm...

Please see the similarity results of the pages for the artist "Herbert
Grönemeyer":
http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more
Post by Frederik “Freso” S. Olesen
Post by Andre Wiethoff
I think it is a bit similar to Google crawling
and provide the results as their own...
Google does pay at least some of their data sources. I know, for one,
that Google is MetaBrainz' biggest "customer" by far (in terms of how
much money they put in the project).
I also see this a bit controversial, as even two IDs could be
intellectual property...
I think that it is the biggest question of whether to allow web crawling
for this purpose at all.
Does anybody else have some insights on this?

Thank your in forward for your answers!

Best regards,

Andre
Frederik “Freso” S. Olesen
2015-05-12 14:40:25 UTC
Permalink
Post by Andre Wiethoff
Post by Frederik “Freso” S. Olesen
This should not be a relationship like we currently have relationships.
This is an entirely subjective piece of information and some people will
consider two artists similar while others will not.
Yes, therefore I did propose to not open it for public editing, but
having it "computer generated" only (by whatever means, be it web
crawling or clustering on large data sets).
But you proposed it to be stored as a relationship like the current
relationships in the MusicBrainz database
 (read below)
Post by Andre Wiethoff
But in the end you agree that the result of such a web crawl/clustering
algorithm/whatever should be stored in the database as final result (for
speedier access of the results) - if implemented at all? But perhaps we
should discuss at first whether the new data would be beneficial for the
users (or the database)...
In *a* db, sure, in *the* (MB) db, no. It is not objective data and it
would not be user generated. It would be far more reasonable to place it
in another (sub)project. See e.g., AcousticBrainz and CritiqueBrainz for
two MetaBrainz projects expanding on the MusicBrainz data without being
inserted directly into the MB site/data themselves. A
RecommendationBrainz or SimilarityBrainz (or, heck, maybe it could be
part of CritiqueBrainz?) would be a better fit for this.

(Also note that having it in a separate project does not mean it cannot
be used by/on MusicBrainz; e.g., CritiqueBrainz reviews are pulled in
for relevant MB release( group)s.)
Post by Andre Wiethoff
Post by Frederik “Freso” S. Olesen
There are already
projects (though I forget which, sorry) that group/cluster entities
based on relationships (which IIRC wasn't completely off), so it is
possible to do something like it with the data in MB already.
I wonder which relationships have been used to group/cluster the
entities? [
]
IIRC, all the relationships. The more times two entities linked to each
other, the closer those two entities were. AFAIK, it's a fairly simple
heuristic, but given the amount of relationships in the MB db, it should
give reasonable results for most fairly well known artists.
Post by Andre Wiethoff
Post by Frederik “Freso” S. Olesen
There's
also AcousticBrainz which can be used to cluster entities based on the
acoustic properties of their recordings.
I don't think that clustering regarding the acoustic properties will
bring any good results for now, I guess this will still take ten years
until there exist something that produces results matchable to a human
expert (or even advanced amateur)...
I wouldn't make it stand on its own, no. ABz is still very much in its
infancy and the tools and algorithms in Essentia are not yet up to par
with this massive 2+ million song dataset currently available in the ABz
database. However, ABz can give you ranges about whether a group does
mostly vocal or instrumental things, whether they're mostly high or low
BPM, whether they have a predominant mood, etc.

These aren't necessarily 100% accurate, but combining similarity on
these values with relationship clustering, I think it may be possible to
get some interesting results (e.g., two artists with a lot of
relationships connecting them that additionally does mostly acoustic,
instrumental happy+relaxed music are likely more similar than two
artists with no relationships connecting them and one doing mostly
instrumental and the other doing mostly vocal stuff).

When/if we get access to scrobbles, that's a third data source that can
be added to the mix, but I really do not think we need it to get started
on a similarity/recommendation engine.
--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB: https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso
Andre Wiethoff
2015-05-12 20:22:25 UTC
Permalink
Hello Frederik,
Post by Frederik “Freso” S. Olesen
Post by Andre Wiethoff
But in the end you agree that the result of such a web crawl/clustering
algorithm/whatever should be stored in the database as final result (for
speedier access of the results) - if implemented at all? But perhaps we
should discuss at first whether the new data would be beneficial for the
users (or the database)...
In *a* db, sure, in *the* (MB) db, no. It is not objective data and it
would not be user generated. It would be far more reasonable to place it
in another (sub)project. See e.g., AcousticBrainz and CritiqueBrainz for
two MetaBrainz projects expanding on the MusicBrainz data without being
inserted directly into the MB site/data themselves. A
RecommendationBrainz or SimilarityBrainz (or, heck, maybe it could be
part of CritiqueBrainz?) would be a better fit for this.
(Also note that having it in a separate project does not mean it cannot
be used by/on MusicBrainz; e.g., CritiqueBrainz reviews are pulled in
for relevant MB release( group)s.)
I see. So most probably I proposed this to the wrong project?
By that point of view, a recommendation engine (recommendation matrix -
probably a sparse matrix stored in a database) should also not be part
of Musicbrainz, but also a another project extending Musicbrainz, right?
Post by Frederik “Freso” S. Olesen
Post by Andre Wiethoff
Post by Frederik “Freso” S. Olesen
There are already
projects (though I forget which, sorry) that group/cluster entities
based on relationships (which IIRC wasn't completely off), so it is
possible to do something like it with the data in MB already.
I wonder which relationships have been used to group/cluster the
entities? […]
IIRC, all the relationships. The more times two entities linked to each
other, the closer those two entities were. AFAIK, it's a fairly simple
heuristic, but given the amount of relationships in the MB db, it should
give reasonable results for most fairly well known artists.
Post by Andre Wiethoff
Post by Frederik “Freso” S. Olesen
There's
also AcousticBrainz which can be used to cluster entities based on the
acoustic properties of their recordings.
I don't think that clustering regarding the acoustic properties will
bring any good results for now, I guess this will still take ten years
until there exist something that produces results matchable to a human
expert (or even advanced amateur)...
I wouldn't make it stand on its own, no. ABz is still very much in its
infancy and the tools and algorithms in Essentia are not yet up to par
with this massive 2+ million song dataset currently available in the ABz
database. However, ABz can give you ranges about whether a group does
mostly vocal or instrumental things, whether they're mostly high or low
BPM, whether they have a predominant mood, etc.
These aren't necessarily 100% accurate, but combining similarity on
these values with relationship clustering, I think it may be possible to
get some interesting results (e.g., two artists with a lot of
relationships connecting them that additionally does mostly acoustic,
instrumental happy+relaxed music are likely more similar than two
artists with no relationships connecting them and one doing mostly
instrumental and the other doing mostly vocal stuff).
This is where we differ (but of course this depends on the definition of
the term "interesting" ;-)
I don't think that the relationship table will give sufficient
information to really find e.g. artists that are closely related (as
quite often the only the band members are known). Combining it with a
large set of acoustic features, which are only probabilities on how
"similar" two songs regarding a given feature is, will not improve the
result that much. I agree that you would get a list of songs (and by
that artists) which are somewhat similar in the kind of music they make,
but this will not provide a (sorted) list of most similar
artists/songs/whatever...
So, if the basis data using the relations is not good enough, adding the
acoustic properties will only allow grouping to very large groups like
you mentioned e.g. with/without vocals or fast/slow BPM.

Perhaps we should start with defining "Similarity" first. Here is my try:
Similarity is the probability of a user also liking artist/song/etc. B
if he likes artist/song/etc A.
(this is a user centric view of similarity - of course each individual
user would see it differently how similar two bands are, but this is
only a probability...)
Post by Frederik “Freso” S. Olesen
Post by Andre Wiethoff
Please see the similarity results of the pages for the artist "Herbert
http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more
I found the site using only MusicBrainz data for its clustering,
except it isn't using just MusicBrainz data ­— but it isn't using
http://richseam.com/artist/m/02cskm http://richseam.com/about-us has
slightly more information on what they are doing.
Thanks for the links!

This exactly shows why the relationships wouldn't work out, using the
example of Herbert Grönemeyer (one of germany big ones). The artist
which is so similar that I can't often differ between them is
Westernhagen, which is listed on AllMusic and Amazon as related (BBC
shows only four related artists...). But analysing the connections by
richseam shows artists like John Smith (which doesn't seem to be a real
artist), Charles Aznavour (which is neither very similar, nor even
singing in the same language), Little Axe (Blues!), ..., then somewhen
"Die Fantastischen Vier" show up which are also singing in the same
language, but do HipHop...
At the end there are actually some few who would match a bit, like
Philipp Poisel (using the relation "has played concert with Grönemeyer"
- which would the only relation that would fullfill my definition of
similarity). But there is no sign of Westernhagen at all.
Only because two artists recorded their songs in a specific studio
doesn't make them related...
Post by Frederik “Freso” S. Olesen
When/if we get access to scrobbles, that's a third data source that can
be added to the mix, but I really do not think we need it to get started
on a similarity/recommendation engine.
Probably I just don't know where to start creating a similarity
algorithm only using the above two feature sets (and my definition of
similarity), but please prove me wrong.
Anyway, doing a recommendation engine based on the mentioned features
will absolutely not be possible (or at least not better than using some
random songs from "similar" artists - however "similar" is defined), as
there are much fewer relations on songs than on artists...

Something completely different: It seems that some audio fingerprints
are misdetected (meaning that one fingerprint has a bunch of results
with high score, but not all of the correct recording). I tested a live
version, but it found also the regular version and one even a cover from
a different group - I assume that either an algorithm has wrongly
assigned the songs metadata to the recording or a user has entered wrong
artist information)...

Best regards,

Andre
Tom Crocker
2015-05-12 20:53:40 UTC
Permalink
On 12 May 2015 21:22, "Andre Wiethoff" <***@exactaudiocopy.de> wrote:
...
Post by Andre Wiethoff
Something completely different: It seems that some audio fingerprints
are misdetected (meaning that one fingerprint has a bunch of results
with high score, but not all of the correct recording). I tested a live
version, but it found also the regular version and one even a cover from
a different group - I assume that either an algorithm has wrongly
assigned the songs metadata to the recording or a user has entered wrong
artist information)...
Yes, this is very common and my understanding is this is usually
incorrectly submitted data. Various software submits acoustids so
depending on how it's set up and what the user selects or what their
existing data is, incorrect acoustids can end up attached. Although I'd say
if there's been a lot of submissions the right recording tends to have many
more.
Having said which...
Sometimes two (or more) recordings will have one acoustid. These are
usually really similar mixes. Vice versa one recording can have multiple
acoustids because for example the speed has changed between released
versions or one is truncated in a way we consider to be insufficient to
treat as two different recordings.
Andre Wiethoff
2015-05-13 10:29:40 UTC
Permalink
Hello Tom,
Post by Tom Crocker
Post by Andre Wiethoff
Something completely different: It seems that some audio fingerprints
are misdetected (meaning that one fingerprint has a bunch of results
with high score, but not all of the correct recording). I tested a live
version, but it found also the regular version and one even a cover from
a different group - I assume that either an algorithm has wrongly
assigned the songs metadata to the recording or a user has entered wrong
artist information)...
Yes, this is very common and my understanding is this is usually
incorrectly submitted data. Various software submits acoustids so
depending on how it's set up and what the user selects or what their
existing data is, incorrect acoustids can end up attached. Although
I'd say if there's been a lot of submissions the right recording tends
to have many more.
Having said which...
Sometimes two (or more) recordings will have one acoustid. These are
usually really similar mixes. Vice versa one recording can have
multiple acoustids because for example the speed has changed between
released versions or one is truncated in a way we consider to be
insufficient to treat as two different recordings.
Thank you for the detailed explanation! I thought that it might be that
way...

If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...

Thanks in forward for your answer!

Best regards,

Andre
Ian McEwen
2015-05-13 14:58:30 UTC
Permalink
Post by Andre Wiethoff
Hello Tom,
Post by Tom Crocker
Post by Andre Wiethoff
Something completely different: It seems that some audio fingerprints
are misdetected (meaning that one fingerprint has a bunch of results
with high score, but not all of the correct recording). I tested a live
version, but it found also the regular version and one even a cover from
a different group - I assume that either an algorithm has wrongly
assigned the songs metadata to the recording or a user has entered wrong
artist information)...
Yes, this is very common and my understanding is this is usually
incorrectly submitted data. Various software submits acoustids so
depending on how it's set up and what the user selects or what their
existing data is, incorrect acoustids can end up attached. Although
I'd say if there's been a lot of submissions the right recording tends
to have many more.
Having said which...
Sometimes two (or more) recordings will have one acoustid. These are
usually really similar mixes. Vice versa one recording can have
multiple acoustids because for example the speed has changed between
released versions or one is truncated in a way we consider to be
insufficient to treat as two different recordings.
Thank you for the detailed explanation! I thought that it might be that
way...
If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...
All AcoustID data is stored on the AcoustID side. We've considered
setting up a system to move a view of that data back to MB as well, but
primarily so we can write reports based on that information and display
it without needing to make a request to remove webservice. At present
it's all displayed by way of a request made in javascript to the
AcoustID API, however.
Post by Andre Wiethoff
Thanks in forward for your answer!
Best regards,
Andre
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Lukáš Lalinský
2015-05-18 22:14:19 UTC
Permalink
On Wed, May 13, 2015 at 3:29 AM, Andre Wiethoff <
Post by Andre Wiethoff
If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...
I generate monthly data files with the MBID-AcoustID links, but
unfortunately they are currently out of date:

http://data.acoustid.org/fullexport/2015-03-01/

(It turns out that generating database dumps that over 100GB compressed is
a fairly non-trivial task and
I have been experiment with the best setup for me to do this.)

Lukas
Joseph Curtin
2015-05-18 22:39:12 UTC
Permalink
Having just sourced the database. As a user, it is possible to restore the
database using Ubuntu 14.04 and the postgresql-ppa.

Would you like me to document the process?

You're going to need at least one terabyte, maybe two. Its not a small db
at all.

All that being said about the current backup process; here is my
unsolicited opinion on a better backup process.

Setup a replication server for backups. Every nth hour, halt the
replication server, tar the datadir, and then restart the replication
server. This might cost a pretty penny when it comes to hosting.
Post by Lukáš Lalinský
On Wed, May 13, 2015 at 3:29 AM, Andre Wiethoff <
Post by Andre Wiethoff
If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...
I generate monthly data files with the MBID-AcoustID links, but
http://data.acoustid.org/fullexport/2015-03-01/
(It turns out that generating database dumps that over 100GB compressed is
a fairly non-trivial task and
I have been experiment with the best setup for me to do this.)
Lukas
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Lukáš Lalinský
2015-05-18 23:13:18 UTC
Permalink
Post by Joseph Curtin
Having just sourced the database. As a user, it is possible to restore the
database using Ubuntu 14.04 and the postgresql-ppa.
Would you like me to document the process?
Definitely. I'd also welcome any patches/ideas on how to make the process
easier. My goal so far was only to run the data on acoustid.org. Running a
mirror was not my priority, because you need fairly expensive hardware to
do it right, so not many people would actually do that anyway.
Post by Joseph Curtin
Setup a replication server for backups. Every nth hour, halt the
replication server, tar the datadir, and then restart the replication
server. This might cost a pretty penny when it comes to hosting.
I actually already have a replicated server just for the data export. I
currently do it while the replication is running, which is often causing
problems, due to PostgreSQL running out of transactions (the export is one
giant day-long serialized transaction). I have actually just moved the
server to a completely separate one with the intention to stop replication
during the process, but I need to figure out how to handle that with regard
to monitoring and things like that.

Lukas
Joseph Curtin
2015-05-19 15:36:15 UTC
Permalink
Post by Lukáš Lalinský
Definitely. I'd also welcome any patches/ideas on how to make the process
easier. My goal so far was only to run the data on acoustid.org. Running a
mirror was not my priority, because you need fairly expensive hardware to
do it right, so not many people would actually do that anyway.

I hope to be able to help here. I took a quick look at what it might take
to upgrade you to 9.4. I determined that it wasn't trivial, but it should
be rather straight forward. When I have the time, I'll see what I can do.
Post by Lukáš Lalinský
I actually already have a replicated server just for the data export. I
currently do it while the replication is running, which is often causing
problems, due to PostgreSQL running out of transactions (the export is one
giant day-long serialized transaction). I have actually just moved the
server to a completely separate one with the intention to stop replication
during the process, but I need to figure out how to handle that with regard
to monitoring and things like that.
What kind of monitoring questions are you looking to answer?

If you copy the datadir, the cost would be cheaper. All you'll be doing is
a bit copy vs scheduling and dumping data all within the context of
postgres. You can transplant datadirs fairly easily as long as you do it
while the server is shutdown.
Post by Lukáš Lalinský
Post by Joseph Curtin
Having just sourced the database. As a user, it is possible to restore
the database using Ubuntu 14.04 and the postgresql-ppa.
Would you like me to document the process?
Definitely. I'd also welcome any patches/ideas on how to make the process
easier. My goal so far was only to run the data on acoustid.org. Running
a mirror was not my priority, because you need fairly expensive hardware to
do it right, so not many people would actually do that anyway.
Post by Joseph Curtin
Setup a replication server for backups. Every nth hour, halt the
replication server, tar the datadir, and then restart the replication
server. This might cost a pretty penny when it comes to hosting.
I actually already have a replicated server just for the data export. I
currently do it while the replication is running, which is often causing
problems, due to PostgreSQL running out of transactions (the export is one
giant day-long serialized transaction). I have actually just moved the
server to a completely separate one with the intention to stop replication
during the process, but I need to figure out how to handle that with regard
to monitoring and things like that.
Lukas
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Frederik “Freso” S. Olesen
2015-05-12 15:46:08 UTC
Permalink
Post by Andre Wiethoff
Please see the similarity results of the pages for the artist "Herbert
http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more
I found the site using only MusicBrainz data for its clustering, except
it isn't using just MusicBrainz data ­— but it isn't using scrobble
data, only inter-artist relationships:
http://richseam.com/artist/m/02cskm

http://richseam.com/about-us has slightly more information on what they
are doing.
--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB: https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso
Loading...