[mb-devel] Gsoc 2015: Intruduction

Discussion:

蔡康

2015-03-21 03:41:28 UTC

Hi,

My name is Kang Cai, graduate student of Peking University, major in audio
information processing. Half a year ago, I took part in âEmotion in Music
â task in âMediaEval 2014â and achieved good results.

Recently, I want to participate in GSoC 2015. After searching for a long
time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.

Best regards,

Kang Cai

Alastair Porter

2015-03-23 09:35:21 UTC

Permalink

Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was your
goal, and what did your results show?

We talked a little bit about emotion in our initial blog post:
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We definitely
want to address this topic more.

For us, there are two parts to any of these training problems. The first
part is to find a dataset that is representative of our topic. As you have
pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?

The second part to address is the actual training method. We're currently
using SVM, with automatic feature selection based on the features present
in our low-level data. Maybe you also have some ideas here about which
training method is most effective. What did your results in your project
show?

Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in audio
information processing. Half a year ago, I took part in âEmotion in
Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a long
time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Cai Kang

2015-03-24 09:01:45 UTC

Permalink

Hi Alastair,

Thanks for your attention.

*Description of âEmotion in musicâ task *

The task is the continuous emotion characterization task. The emotional
dimensions, arousal and valence (VA), should be determined for a given song
continuously in time. The quantization scale will be per frame (e.g., 1s).
It will provide a set of music licensed under Creative Commons from Free
Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.

*Description of Dataset*

It uses an extension of 744 songs dataset developed for the same task at
Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.

*Our working note*

http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf

Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling the
continuous emotions better, we adopted the âCCRFâ model. Our results shows
a high Pearson correlation and low root mean square error. In fact we also
test other regression such as NN, KNN, but the performances of them are not
better than SVR+CCRF. But CCRF canât be adopted for static emotion directly.

*About the emotion topic*

For the first part, I honestly donât have an idea how to collect a large
training set with reasonable distribution for now. And I am curious about
how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.

Loading Image...

For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.

Best regards,

Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was your
goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We definitely
want to address this topic more.
For us, there are two parts to any of these training problems. The first
part is to find a dataset that is representative of our topic. As you have
pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're currently
using SVM, with automatic feature selection based on the features present
in our low-level data. Maybe you also have some ideas here about which
training method is most effective. What did your results in your project
show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in âEmotion
in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a long
time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Alastair Porter

2015-03-25 14:20:13 UTC

Permalink

Hi Kang,
Thanks for the update on your project.

As we explained in the blog post, the results we reported are automatically
extracted given our existing models. Some subsequent research that we've
done indicates that many of the labels don't match with other ground truth
that we've gathered (e.g., tags on last.fm that represent mood).

We do some feature selection as part of the training process, although I
don't know all of the details of that part of the system. You can read some
more about the training system here:
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia

You mention that arousal/valence has an advantage because you are able to
create only one model, but I'm not sure that this is a strong enough
argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.

I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)

Are you interested in a specific project for AcousticBrainz for Soc? If so,
you should outline what you want to do. Two points to keep in mind:
- It's difficult for us to get additional low-level features (since we
would need to ask the community to recompute them for us), so if you wanted
to do some model generation, the easiest source of data is the features
that we already have
- We're not very interested in small improvements in classifier accuracy
over small training/testing datasets, as we've seen that this doesn't
appear to scale very well.

A combination of large-scale data collection plus a specific improvement to
a single classifier might be a good task.

Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The emotional
dimensions, arousal and valence (VA), should be determined for a given song
continuously in time. The quantization scale will be per frame (e.g., 1s).
It will provide a set of music licensed under Creative Commons from Free
Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task at
Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling the
continuous emotions better, we adopted the âCCRFâ model. Our results shows
a high Pearson correlation and low root mean square error. In fact we also
test other regression such as NN, KNN, but the performances of them are not
better than SVR+CCRF. But CCRF canât be adopted for static emotion directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a large
training set with reasonable distribution for now. And I am curious about
how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was your
goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The first
part is to find a dataset that is representative of our topic. As you have
pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're currently
using SVM, with automatic feature selection based on the features present
in our low-level data. Maybe you also have some ideas here about which
training method is most effective. What did your results in your project
show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in âEmotion
in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a long
time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Cai Kang

2015-03-26 05:05:24 UTC

Permalink

Hi Alastair,

Thanks for your helpful advice.

I quite agree with your idea of taking low-level features as a stable
source, while recomputing them is laborious and there is no urgent need for
doing that right now.

As you said, the biggest problem is that the training set is too small.
Applying the trained model to millions of unknown tracks basically depends
upon luck. So a tool to crowd-source training set is in urgent need and Iâm
interested in developing it. I have the experience of online manual
annotation on music emotion and I think designing a tagging system for
users is a good choice. User can tag for a piece of music while listening
to it.

Considering that AcrousticBrainz just stores documents of information not
the audio files, the tagging system should be designed to run on client.
The server just receives the tagging attributes and the corresponding
audioâs MBID (maybe tagged by Picard on client) from the client. After
that, the mapping is converted into a form that our existing tools can
understand.

Of course, user can also upload the document of attributes in a certain
form without listening to music. In that case, all that the client needs to
do is get the MBID and send the mapping between attributes and MBID to the
server.

Whatâs more, I think itâs cool if we could make analysis of individual
difference on music mood through the tool. The only additional thing we
should do is taking userâs basic information such as gender\age\
location\character into consideration. When users build the mappings
through the tool, they could add the basic information of themselves. Then
we can do a research on the influence of individual difference in music
emotion perception by using existing algorithms.

Look forward to your feedbacks and suggestions.

Best regards,

Kang

Post by Alastair Porter
Hi Kang,
Thanks for the update on your project.
As we explained in the blog post, the results we reported are
automatically extracted given our existing models. Some subsequent research
that we've done indicates that many of the labels don't match with other
ground truth that we've gathered (e.g., tags on last.fm that represent
mood).
We do some feature selection as part of the training process, although I
don't know all of the details of that part of the system. You can read some
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia
You mention that arousal/valence has an advantage because you are able to
create only one model, but I'm not sure that this is a strong enough
argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.
I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)
Are you interested in a specific project for AcousticBrainz for Soc? If
- It's difficult for us to get additional low-level features (since we
would need to ask the community to recompute them for us), so if you wanted
to do some model generation, the easiest source of data is the features
that we already have
- We're not very interested in small improvements in classifier accuracy
over small training/testing datasets, as we've seen that this doesn't
appear to scale very well.
A combination of large-scale data collection plus a specific improvement
to a single classifier might be a good task.
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The emotional
dimensions, arousal and valence (VA), should be determined for a given song
continuously in time. The quantization scale will be per frame (e.g., 1s).
It will provide a set of music licensed under Creative Commons from Free
Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task at
Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling
the continuous emotions better, we adopted the âCCRFâ model. Our results
shows a high Pearson correlation and low root mean square error. In fact we
also test other regression such as NN, KNN, but the performances of them
are not better than SVR+CCRF. But CCRF canât be adopted for static emotion
directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a large
training set with reasonable distribution for now. And I am curious about
how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was your
goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The first
part is to find a dataset that is representative of our topic. As you have
pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a
long time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Alastair Porter

2015-03-26 22:41:50 UTC

Permalink

Hi Kang,
Why do you think a local client is the best way of contributing data? I'm
not sure if you mean a client that plays random tracks to people and asks
them to classify them, or something built into an existing music player.
If the first option, what ideas do you have about stopping people from
contributing bad information or becoming bored in the task?
If the second, I have concerns that people listening to music during their
daily routine would not be interested in switching back to another
application every 3 minutes to tag a song.
We already have a number of proposals for a generic system to let people
assign tags/labels to musicbrainz ids. My suggestion is that you would have
a stronger chance of being accepted if you shifted the focus of your
proposal more towards the training evaluation that you were talking about.

Personally, I don't think only age/gender/location alone is enough
information for separating individual differences between labels, but I
like the possibility of clustering people's preferences over different
classification problems. Do you have any more ideas in this direction?

Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your helpful advice.
I quite agree with your idea of taking low-level features as a stable
source, while recomputing them is laborious and there is no urgent need for
doing that right now.
As you said, the biggest problem is that the training set is too small.
Applying the trained model to millions of unknown tracks basically depends
upon luck. So a tool to crowd-source training set is in urgent need and Iâm
interested in developing it. I have the experience of online manual
annotation on music emotion and I think designing a tagging system for
users is a good choice. User can tag for a piece of music while listening
to it.
Considering that AcrousticBrainz just stores documents of information not
the audio files, the tagging system should be designed to run on client.
The server just receives the tagging attributes and the corresponding
audioâs MBID (maybe tagged by Picard on client) from the client. After
that, the mapping is converted into a form that our existing tools can
understand.
Of course, user can also upload the document of attributes in a certain
form without listening to music. In that case, all that the client needs to
do is get the MBID and send the mapping between attributes and MBID to the
server.
Whatâs more, I think itâs cool if we could make analysis of individual
difference on music mood through the tool. The only additional thing we
should do is taking userâs basic information such as gender\age\
location\character into consideration. When users build the mappings
through the tool, they could add the basic information of themselves. Then
we can do a research on the influence of individual difference in music
emotion perception by using existing algorithms.
Look forward to your feedbacks and suggestions.
Best regards,
Kang

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The emotional
dimensions, arousal and valence (VA), should be determined for a given song
continuously in time. The quantization scale will be per frame (e.g., 1s).
It will provide a set of music licensed under Creative Commons from Free
Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task at
Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling
the continuous emotions better, we adopted the âCCRFâ model. Our results
shows a high Pearson correlation and low root mean square error. In fact we
also test other regression such as NN, KNN, but the performances of them
are not better than SVR+CCRF. But CCRF canât be adopted for static emotion
directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a large
training set with reasonable distribution for now. And I am curious about
how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was
your goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The
first part is to find a dataset that is representative of our topic. As you
have pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a
long time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Cai Kang

2015-03-27 06:55:35 UTC

Permalink

Hi Alastair,

Thanks for your helpful advice.

I have updated my proposal by reference to your suggestions. I add a
detailed plan on visualizing evaluation statistics for models and filter
the bad information people contribute. More plans of analysis on high-level
features also have been made.

For the tagging tool, I don't think a local client is the best way to
realize it. But I think online tagging system may have difficulty in
copyright problem and AcousticBrainz don't store the audio files on its
server, so the local client is a compromise solution. My understanding of
the tool's main function is tagging for building dataset. If we want user
to have interested in tagging, maybe a "scoring" or "testing for fun"
tagging tool could be applied.

Thank you and look forward to having your opinion on my proposal.

Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Why do you think a local client is the best way of contributing data? I'm
not sure if you mean a client that plays random tracks to people and asks
them to classify them, or something built into an existing music player.
If the first option, what ideas do you have about stopping people from
contributing bad information or becoming bored in the task?
If the second, I have concerns that people listening to music during their
daily routine would not be interested in switching back to another
application every 3 minutes to tag a song.
We already have a number of proposals for a generic system to let people
assign tags/labels to musicbrainz ids. My suggestion is that you would have
a stronger chance of being accepted if you shifted the focus of your
proposal more towards the training evaluation that you were talking about.
Personally, I don't think only age/gender/location alone is enough
information for separating individual differences between labels, but I
like the possibility of clustering people's preferences over different
classification problems. Do you have any more ideas in this direction?
Regards,
Alastair

Post by Alastair Porter
Hi Kang,
Thanks for the update on your project.
As we explained in the blog post, the results we reported are
automatically extracted given our existing models. Some subsequent research
that we've done indicates that many of the labels don't match with other
ground truth that we've gathered (e.g., tags on last.fm that represent
mood).
We do some feature selection as part of the training process, although I
don't know all of the details of that part of the system. You can read some
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia
You mention that arousal/valence has an advantage because you are able
to create only one model, but I'm not sure that this is a strong enough
argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.
I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)
Are you interested in a specific project for AcousticBrainz for Soc? If
- It's difficult for us to get additional low-level features (since we
would need to ask the community to recompute them for us), so if you wanted
to do some model generation, the easiest source of data is the features
that we already have
- We're not very interested in small improvements in classifier
accuracy over small training/testing datasets, as we've seen that this
doesn't appear to scale very well.
A combination of large-scale data collection plus a specific improvement
to a single classifier might be a good task.
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The emotional
dimensions, arousal and valence (VA), should be determined for a given song
continuously in time. The quantization scale will be per frame (e.g., 1s).
It will provide a set of music licensed under Creative Commons from Free
Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task
at Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling
the continuous emotions better, we adopted the âCCRFâ model. Our results
shows a high Pearson correlation and low root mean square error. In fact we
also test other regression such as NN, KNN, but the performances of them
are not better than SVR+CCRF. But CCRF canât be adopted for static emotion
directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a
large training set with reasonable distribution for now. And I am curious
about how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was
your goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The
first part is to find a dataset that is representative of our topic. As you
have pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a
long time, I finally find the interesting project âAcousticBrainzâ. The
projectâs main idea is to realize automatic tagging for music through
semi-supervised machine learning. For me, this project has three major
challenges. The first one is how to work well with existing algorithms to
realize it. The second one is the âbig dataâ, which is different from the
small dataset I used for experiment in my lab. The last one is this is my
first time to apply for online cooperative project, kind of excited.
Although Iâm not familiar with the existing framework of this project, I
wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Alastair Porter

2015-03-27 15:07:51 UTC

Permalink

OK, I understand now that your suggestion for a local client comes from
wanting people to be able to listen to the audio as they label it.
We could propose that people just label audio that they know - that is,
they can search for the track or album or enter an mbid and choose a label.
I'm quite sure that building a local client is not the direction we want to
go with the labeling task.

We've been talking about your proposal and are really interested in working
on some of your ideas. Are you able to join us in the #musicbrainz-devel
IRC channel to keep talking before the submission deadline?

Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your helpful advice.
I have updated my proposal by reference to your suggestions. I add a
detailed plan on visualizing evaluation statistics for models and filter
the bad information people contribute. More plans of analysis on high-level
features also have been made.
For the tagging tool, I don't think a local client is the best way to
realize it. But I think online tagging system may have difficulty in
copyright problem and AcousticBrainz don't store the audio files on its
server, so the local client is a compromise solution. My understanding of
the tool's main function is tagging for building dataset. If we want user
to have interested in tagging, maybe a "scoring" or "testing for fun"
tagging tool could be applied.
Thank you and look forward to having your opinion on my proposal.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Why do you think a local client is the best way of contributing data? I'm
not sure if you mean a client that plays random tracks to people and asks
them to classify them, or something built into an existing music player.
If the first option, what ideas do you have about stopping people from
contributing bad information or becoming bored in the task?
If the second, I have concerns that people listening to music during
their daily routine would not be interested in switching back to another
application every 3 minutes to tag a song.
We already have a number of proposals for a generic system to let people
assign tags/labels to musicbrainz ids. My suggestion is that you would have
a stronger chance of being accepted if you shifted the focus of your
proposal more towards the training evaluation that you were talking about.
Personally, I don't think only age/gender/location alone is enough
information for separating individual differences between labels, but I
like the possibility of clustering people's preferences over different
classification problems. Do you have any more ideas in this direction?
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your helpful advice.
I quite agree with your idea of taking low-level features as a stable
source, while recomputing them is laborious and there is no urgent need for
doing that right now.
As you said, the biggest problem is that the training set is too small.
Applying the trained model to millions of unknown tracks basically depends
upon luck. So a tool to crowd-source training set is in urgent need and Iâm
interested in developing it. I have the experience of online manual
annotation on music emotion and I think designing a tagging system for
users is a good choice. User can tag for a piece of music while listening
to it.
Considering that AcrousticBrainz just stores documents of information
not the audio files, the tagging system should be designed to run on
client. The server just receives the tagging attributes and the
corresponding audioâs MBID (maybe tagged by Picard on client) from the
client. After that, the mapping is converted into a form that our existing
tools can understand.
Of course, user can also upload the document of attributes in a certain
form without listening to music. In that case, all that the client needs to
do is get the MBID and send the mapping between attributes and MBID to the
server.
Whatâs more, I think itâs cool if we could make analysis of individual
difference on music mood through the tool. The only additional thing we
should do is taking userâs basic information such as gender\age\
location\character into consideration. When users build the mappings
through the tool, they could add the basic information of themselves. Then
we can do a research on the influence of individual difference in music
emotion perception by using existing algorithms.
Look forward to your feedbacks and suggestions.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for the update on your project.
As we explained in the blog post, the results we reported are
automatically extracted given our existing models. Some subsequent research
that we've done indicates that many of the labels don't match with other
ground truth that we've gathered (e.g., tags on last.fm that represent
mood).
We do some feature selection as part of the training process, although
I don't know all of the details of that part of the system. You can read
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia
You mention that arousal/valence has an advantage because you are able
to create only one model, but I'm not sure that this is a strong enough
argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.
I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)
Are you interested in a specific project for AcousticBrainz for Soc? If
- It's difficult for us to get additional low-level features (since we
would need to ask the community to recompute them for us), so if you wanted
to do some model generation, the easiest source of data is the features
that we already have
- We're not very interested in small improvements in classifier
accuracy over small training/testing datasets, as we've seen that this
doesn't appear to scale very well.
A combination of large-scale data collection plus a specific
improvement to a single classifier might be a good task.
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The
emotional dimensions, arousal and valence (VA), should be determined for a
given song continuously in time. The quantization scale will be per frame
(e.g., 1s). It will provide a set of music licensed under Creative Commons
from Free Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task
at Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For modeling
the continuous emotions better, we adopted the âCCRFâ model. Our results
shows a high Pearson correlation and low root mean square error. In fact we
also test other regression such as NN, KNN, but the performances of them
are not better than SVR+CCRF. But CCRF canât be adopted for static emotion
directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a
large training set with reasonable distribution for now. And I am curious
about how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was
your goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The
first part is to find a dataset that is representative of our topic. As you
have pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major in
audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a
long time, I finally find the interesting project âAcousticBrainzâ.
The projectâs main idea is to realize automatic tagging for music
through semi-supervised machine learning. For me, this project has three
major challenges. The first one is how to work well with existing
algorithms to realize it. The second one is the âbig dataâ, which is
different from the small dataset I used for experiment in my lab. The last
one is this is my first time to apply for online cooperative project, kind
of excited. Although Iâm not familiar with the existing framework of this
project, I wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Cai Kang

2015-03-28 01:31:34 UTC

Permalink

Hi Alastair,
Thanks for your invitation.

Through the chat with ruaok on the IRC channel, I can see ABâs method on
classifier is a hierarchical classifier. And the method for scaling up the
dataset is semi-supervised machine learning. Then Iâll talk about my idea
based on those.

Step1: Naming the original dataset as O (OLD). First, fetch a random sample
from millions of unlabeled tracks. Taking the credibility of SVM predicting
result as an evaluation. If the credibility is beyond a certain high
threshold, for example, 0.8, then adding the sample to O. If the
credibility is below 0.8 but beyond a certain low threshold, for example,
0.5, then adding the sample to a candidate set named N (NEW). If the
credibility is below 0.5, then we can remove the sample, or just reduce the
probability being selected next time. It should be mentioned that the
credibility, or we can call it possibility, means it in a layer of the
hierarchical classifier, not the global credibility.

Step2: When the number of N increases to one-tenth of the last dataset Oâs
number, then make cross-validation on the additional dataset N or the
entire dataset consisting of A and B. If the performance is beyond demand,
combining N and O into a new dataset. Then repeat step 1 and 2 and the
entire process will not need additional groundtruth.

We can make some improvements on that. First, for the step 2, divide the B
into 3 parts, then evaluate each of them, choose the part with the best
performance. Second, we can do the dataset scaling up process by
parallelization. That is to say, every step can be processing in parallel.
We can also generate the final datasets in parallel and choose the best one.

Honestly, I am even not sure whether the topic above is what you mean.
Looking forward to your comments and further suggestions.

Best regards,

Kang

Post by Alastair Porter
OK, I understand now that your suggestion for a local client comes from
wanting people to be able to listen to the audio as they label it.
We could propose that people just label audio that they know - that is,
they can search for the track or album or enter an mbid and choose a label.
I'm quite sure that building a local client is not the direction we want to
go with the labeling task.
We've been talking about your proposal and are really interested in
working on some of your ideas. Are you able to join us in the
#musicbrainz-devel IRC channel to keep talking before the submission
deadline?
Alastair

Post by Alastair Porter
Hi Kang,
Why do you think a local client is the best way of contributing data?
I'm not sure if you mean a client that plays random tracks to people and
asks them to classify them, or something built into an existing music
player.
If the first option, what ideas do you have about stopping people from
contributing bad information or becoming bored in the task?
If the second, I have concerns that people listening to music during
their daily routine would not be interested in switching back to another
application every 3 minutes to tag a song.
We already have a number of proposals for a generic system to let people
assign tags/labels to musicbrainz ids. My suggestion is that you would have
a stronger chance of being accepted if you shifted the focus of your
proposal more towards the training evaluation that you were talking about.
Personally, I don't think only age/gender/location alone is enough
information for separating individual differences between labels, but I
like the possibility of clustering people's preferences over different
classification problems. Do you have any more ideas in this direction?
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your helpful advice.
I quite agree with your idea of taking low-level features as a stable
source, while recomputing them is laborious and there is no urgent need for
doing that right now.
As you said, the biggest problem is that the training set is too small.
Applying the trained model to millions of unknown tracks basically depends
upon luck. So a tool to crowd-source training set is in urgent need and Iâm
interested in developing it. I have the experience of online manual
annotation on music emotion and I think designing a tagging system for
users is a good choice. User can tag for a piece of music while listening
to it.
Considering that AcrousticBrainz just stores documents of information
not the audio files, the tagging system should be designed to run on
client. The server just receives the tagging attributes and the
corresponding audioâs MBID (maybe tagged by Picard on client) from the
client. After that, the mapping is converted into a form that our existing
tools can understand.
Of course, user can also upload the document of attributes in a certain
form without listening to music. In that case, all that the client needs to
do is get the MBID and send the mapping between attributes and MBID to the
server.
Whatâs more, I think itâs cool if we could make analysis of individual
difference on music mood through the tool. The only additional thing we
should do is taking userâs basic information such as gender\age\
location\character into consideration. When users build the mappings
through the tool, they could add the basic information of themselves. Then
we can do a research on the influence of individual difference in music
emotion perception by using existing algorithms.
Look forward to your feedbacks and suggestions.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for the update on your project.
As we explained in the blog post, the results we reported are
automatically extracted given our existing models. Some subsequent research
that we've done indicates that many of the labels don't match with other
ground truth that we've gathered (e.g., tags on last.fm that
represent mood).
We do some feature selection as part of the training process, although
I don't know all of the details of that part of the system. You can read
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia
You mention that arousal/valence has an advantage because you are able
to create only one model, but I'm not sure that this is a strong enough
argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.
I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)
Are you interested in a specific project for AcousticBrainz for Soc?
- It's difficult for us to get additional low-level features (since
we would need to ask the community to recompute them for us), so if you
wanted to do some model generation, the easiest source of data is the
features that we already have
- We're not very interested in small improvements in classifier
accuracy over small training/testing datasets, as we've seen that this
doesn't appear to scale very well.
A combination of large-scale data collection plus a specific
improvement to a single classifier might be a good task.
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The
emotional dimensions, arousal and valence (VA), should be determined for a
given song continuously in time. The quantization scale will be per frame
(e.g., 1s). It will provide a set of music licensed under Creative Commons
from Free Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same task
at Mediaeval 2013. The annotations are collected on Amazon Mechanical Turk.
Single workers provided A-V labels for clips from our dataset, consisting
of 744 30-second clips, which are extended to 45 seconds in the annotation
task to give workers additional practice. The labels will be collected at
1Hz. Workers are given detailed instructions describing the A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For
modeling the continuous emotions better, we adopted the âCCRFâ model. Our
results shows a high Pearson correlation and low root mean square error. In
fact we also test other regression such as NN, KNN, but the performances of
them are not better than SVR+CCRF. But CCRF canât be adopted for static
emotion directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a
large training set with reasonable distribution for now. And I am curious
about how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present are
complete, but they also make the dimension of features too high, which may
lead to over fitting and ask for larger dataset. I donât know whether you
have adopted some dimensionality reduction methods such as PCA, LDA, NMF.
These may help. And for features, DNN may be a good way for exploring a
method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was
your goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The
first part is to find a dataset that is representative of our topic. As you
have pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major
in audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for a
long time, I finally find the interesting project âAcousticBrainzâ.
The projectâs main idea is to realize automatic tagging for music
through semi-supervised machine learning. For me, this project has three
major challenges. The first one is how to work well with existing
algorithms to realize it. The second one is the âbig dataâ, which is
different from the small dataset I used for experiment in my lab. The last
one is this is my first time to apply for online cooperative project, kind
of excited. Although Iâm not familiar with the existing framework of this
project, I wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

Alastair Porter

2015-04-06 19:47:32 UTC

Permalink

HI Kang,
Sorry about this delay in replying, I must have missed the email.

In Step 1 you're talking about an existing svm model. I'm not sure if you
were referring to one that we already have, or building a new one.
Your idea about semi-supervised learning is good, but we're still not sure
about some of the unknowns involved in this method. We run the risk of
getting further and further away from reality without even realising it.

Our interest is more in the hierarchical classifier, where for example we
have a first level of classifiers that broadly group a song. Taking the
genre example, these high level groups could be something like "Jazz",
"pop/rock", "instrumental", "instrumental classical", "indian classical".
We would classify a song as one of these, and based on the result
re-classify using a more specific classifier. E.g. if we choose Jazz, we
could sub-classify as "Ragtime", "Swing", "Bop", "Contemporary", etc...
We're a little unsure if this will yield good results - consider that human
category groups don't always follow similar sounding songs. It could be
that two sub-types of Jazz are very different, while one subtype is
acoustically similar to a subtype of another highlevel genre.
An addition to this may be to see if we can automatically create clusters
of similar "sounding" songs (based on features), and see if they match with
known labels that we can find.

Hope this helps. Let me know if you have any more questions.
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your invitation.
Through the chat with ruaok on the IRC channel, I can see ABâs method on
classifier is a hierarchical classifier. And the method for scaling up the
dataset is semi-supervised machine learning. Then Iâll talk about my idea
based on those.
Step1: Naming the original dataset as O (OLD). First, fetch a random
sample from millions of unlabeled tracks. Taking the credibility of SVM
predicting result as an evaluation. If the credibility is beyond a certain
high threshold, for example, 0.8, then adding the sample to O. If the
credibility is below 0.8 but beyond a certain low threshold, for example,
0.5, then adding the sample to a candidate set named N (NEW). If the
credibility is below 0.5, then we can remove the sample, or just reduce the
probability being selected next time. It should be mentioned that the
credibility, or we can call it possibility, means it in a layer of the
hierarchical classifier, not the global credibility.
Step2: When the number of N increases to one-tenth of the last dataset Oâs
number, then make cross-validation on the additional dataset N or the
entire dataset consisting of A and B. If the performance is beyond demand,
combining N and O into a new dataset. Then repeat step 1 and 2 and the
entire process will not need additional groundtruth.
We can make some improvements on that. First, for the step 2, divide the B
into 3 parts, then evaluate each of them, choose the part with the best
performance. Second, we can do the dataset scaling up process by
parallelization. That is to say, every step can be processing in parallel.
We can also generate the final datasets in parallel and choose the best one.
Honestly, I am even not sure whether the topic above is what you mean.
Looking forward to your comments and further suggestions.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Why do you think a local client is the best way of contributing data?
I'm not sure if you mean a client that plays random tracks to people and
asks them to classify them, or something built into an existing music
player.
If the first option, what ideas do you have about stopping people from
contributing bad information or becoming bored in the task?
If the second, I have concerns that people listening to music during
their daily routine would not be interested in switching back to another
application every 3 minutes to tag a song.
We already have a number of proposals for a generic system to let
people assign tags/labels to musicbrainz ids. My suggestion is that you
would have a stronger chance of being accepted if you shifted the focus of
your proposal more towards the training evaluation that you were talking
about.
Personally, I don't think only age/gender/location alone is enough
information for separating individual differences between labels, but I
like the possibility of clustering people's preferences over different
classification problems. Do you have any more ideas in this direction?
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your helpful advice.
I quite agree with your idea of taking low-level features as a stable
source, while recomputing them is laborious and there is no urgent need for
doing that right now.
As you said, the biggest problem is that the training set is too
small. Applying the trained model to millions of unknown tracks basically
depends upon luck. So a tool to crowd-source training set is in urgent need
and Iâm interested in developing it. I have the experience of online manual
annotation on music emotion and I think designing a tagging system for
users is a good choice. User can tag for a piece of music while listening
to it.
Considering that AcrousticBrainz just stores documents of information
not the audio files, the tagging system should be designed to run on
client. The server just receives the tagging attributes and the
corresponding audioâs MBID (maybe tagged by Picard on client) from the
client. After that, the mapping is converted into a form that our existing
tools can understand.
Of course, user can also upload the document of attributes in a
certain form without listening to music. In that case, all that the client
needs to do is get the MBID and send the mapping between attributes and
MBID to the server.
Whatâs more, I think itâs cool if we could make analysis of individual
difference on music mood through the tool. The only additional thing we
should do is taking userâs basic information such as gender\age\
location\character into consideration. When users build the mappings
through the tool, they could add the basic information of themselves. Then
we can do a research on the influence of individual difference in music
emotion perception by using existing algorithms.
Look forward to your feedbacks and suggestions.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for the update on your project.
As we explained in the blog post, the results we reported are
automatically extracted given our existing models. Some subsequent research
that we've done indicates that many of the labels don't match with other
ground truth that we've gathered (e.g., tags on last.fm that
represent mood).
We do some feature selection as part of the training process,
although I don't know all of the details of that part of the system. You
https://github.com/MTG/essentia/blob/master/FAQ.md#training-and-running-classifier-models-in-gaia
You mention that arousal/valence has an advantage because you are
able to create only one model, but I'm not sure that this is a strong
enough argument on its own to use this rating system instead of independent
models. One thing we're trying to do with AcousticBrainz is to put more
"human" labels to the data that we're extracting. So, while I can see some
of the value about rating songs in AV space, we still have an interest in
specific labels as well.
I agree that finding training data for such a large dataset can be
difficult. Our experience has been that training sets of only a few hundred
samples are not giving us very promising results when applying the model to
millions of unknown tracks, even if the evaluations on a small testing set
show good results. We are planning on building some more tools to
crowd-source training sets, but this still ongoing (and also one of our
projects for SoC)
Are you interested in a specific project for AcousticBrainz for Soc?
- It's difficult for us to get additional low-level features (since
we would need to ask the community to recompute them for us), so if you
wanted to do some model generation, the easiest source of data is the
features that we already have
- We're not very interested in small improvements in classifier
accuracy over small training/testing datasets, as we've seen that this
doesn't appear to scale very well.
A combination of large-scale data collection plus a specific
improvement to a single classifier might be a good task.
Regards,
Alastair

Post by Cai Kang
Hi Alastair,
Thanks for your attention.
*Description of âEmotion in musicâ task *
The task is the continuous emotion characterization task. The
emotional dimensions, arousal and valence (VA), should be determined for a
given song continuously in time. The quantization scale will be per frame
(e.g., 1s). It will provide a set of music licensed under Creative Commons
from Free Music Archive with human annotations. Participants upload the VA
predictions of testing set. The goal is to make the Pearson correlation as
high as possible and root mean square error as low as possible.
*Description of Dataset*
It uses an extension of 744 songs dataset developed for the same
task at Mediaeval 2013. The annotations are collected on Amazon Mechanical
Turk. Single workers provided A-V labels for clips from our dataset,
consisting of 744 30-second clips, which are extended to 45 seconds in the
annotation task to give workers additional practice. The labels will be
collected at 1Hz. Workers are given detailed instructions describing the
A-V space.
*Our working note*
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_16.pdf
Our approach is also to use âlow-level featuresâ + âSVRâ. For
modeling the continuous emotions better, we adopted the âCCRFâ model. Our
results shows a high Pearson correlation and low root mean square error. In
fact we also test other regression such as NN, KNN, but the performances of
them are not better than SVR+CCRF. But CCRF canât be adopted for static
emotion directly.
*About the emotion topic*
For the first part, I honestly donât have an idea how to collect a
large training set with reasonable distribution for now. And I am curious
about how the existing 650,000 tracks's labels come from. However, I think
regarding emotion as a two-dimension space is a good way to build only one
model instead of building one model for each mood. "Arousal" is the
level/amount of physical response and "valence" is the emotional
"direction" of that emotion. The image below shows details.
http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/ta/2012/02/figures/tta20120202371.gif
For the second part, I can see the low-level features you present
are complete, but they also make the dimension of features too high, which
may lead to over fitting and ask for larger dataset. I donât know whether
you have adopted some dimensionality reduction methods such as PCA, LDA,
NMF. These may help. And for features, DNN may be a good way for exploring
a method of better performance. For classifier, maybe some types of neural
networks have a good performance. As far as I am concerned, long short-Term
memory based recurrent neural network (LSTM RNN) have advantages in music
emotion classification or regression.
Best regards,
Kang

Post by Alastair Porter
Hi Kang,
Thanks for your email.
Do you have more results about your emotion in music task? What was
your goal, and what did your results show?
http://blog.musicbrainz.org/2014/11/21/what-do-650000-files-look-like-anyway/
And discovered that our existing results are not that great. We
definitely want to address this topic more.
For us, there are two parts to any of these training problems. The
first part is to find a dataset that is representative of our topic. As you
have pointed out, there may be a problem with using small datasets on a
collection as large as AcousticBrainz.
Do you have any ideas how we could collect a large training set?
The second part to address is the actual training method. We're
currently using SVM, with automatic feature selection based on the features
present in our low-level data. Maybe you also have some ideas here about
which training method is most effective. What did your results in your
project show?
Regards,
Alastair

Post by è¡åº·
Hi,
My name is Kang Cai, graduate student of Peking University, major
in audio information processing. Half a year ago, I took part in
âEmotion in Music â task in âMediaEval 2014â and achieved good results.
Recently, I want to participate in GSoC 2015. After searching for
a long time, I finally find the interesting project âAcousticBrainzâ.
The projectâs main idea is to realize automatic tagging for music
through semi-supervised machine learning. For me, this project has three
major challenges. The first one is how to work well with existing
algorithms to realize it. The second one is the âbig dataâ, which is
different from the small dataset I used for experiment in my lab. The last
one is this is my first time to apply for online cooperative project, kind
of excited. Although Iâm not familiar with the existing framework of this
project, I wish I could have the chance to work on it.
Best regards,
Kang Cai
_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel