Andre Wiethoff
2015-06-29 19:16:12 UTC
Hello everybody,
as the other automatic crawlers I planned made no sense anymore, as some
features to be extracted could not be handled by the database, I decided
for a different automatic crawler.
As an experiment, I performed a search for various artists and songs,
which are well known, on YouTube and Soundcloud. The retrieved URLs are
stored. In a second step these URLs were automatically crawled (using
youtube-dl.exe and fpcalc.exe from AcoustID).
The result are >100k YouTube links and a little less than 200k
Soundcloud links. Of course quite many of these links will not contain
what their title promises, but for that the fingerprint information
would help to provide correct assignment to recordings.
A sample entry from the resulting text files looks like:
BEGIN
DURATION=364
FINGERPRINT=AQADtEuiSFEUNRG-H-PToVOc4hfuow_h41GOPA_uw_mgHR9uHTVe_IGPY8eLC19GGD-u4tphGTlz4qpxwSqP8ye2bsa7Bd-OF70w_silXJClSJTQvyipHZqRHvfhEU-O-4cPwx-qD7mE5FVC3OkOMNfw4P3BEz72HmlVFe8luOjxLOmCczEeHl-OpscZHeKPw-RRPNYALT_aHXqCV0bzIxd6HqeIvsgv4Yd1VByRR3jQMM3RaxN-fNkxBRWHXMdDXSK0hgkniJPg8wisvDi8Cz98HJM-rJXwB_kDejnxFxql7MKDHNR59LtwkyzWOYi1wwx-mBmPH6-GO0K7wz9cZ4cuHD_-4OyR_kRP4PiBmDt66MjxwTl8rXiOkE5IQz9yubiPRzB3_BfeKDr-cfjQ57iO6IcO5-jaF7-6FBf0Ct_xK0OajUfeTPgV40d_BXYt_Aj_Db3h_DjC7GMG3TJyJscpuI4WnMSXHPlj6ImFMD_6FS_qC_6RH00A_cOj4z76w91yVMcPPzg3HM9KTMnRKyeeztByo_6H_LjQnHHRJ2XxTzR-_M2Qo8cefRBP_LhDGVXCD7oIPDdyJz2aMTt6xE8afH6gqZdwLTquwTz-QJOPx2uwR8WP_yhPNA5x__iFo3_g49rxA26hHmaPHkd1uD5kE1cOG8Z_HD_8HT4OW0qOHrUT_MOPPzi1PLATHp-wHz8ciUN_lDN8HDrhHT98o4eOMzmqnOqQBt_xLIBHRD9-Q8wzdBPx5scLc8QZ5FnwqEHd4XhWHD7CB8-Pq9ADcz3yjbiTVWj0ox7-CT088gifOHihXTFcHv2h4zVyNNXRLvnxhME1STIe1MYP_NiVwePxWgnSL1CdHemXHG3wE3mUQxfCSFSPnime5Wh-VD_y5kg6NMd9OI-HPjz-dcM-PEeyXER44kQTR5WGrqVINCeH4sgX4egF6w9-FRX6w_GMnMfLHJfgP7iNN4Wn4biGZk4a5N0O_biHE_7xG1-VEc8bQhf6Gzd0HemPk8ctnAvhk_BR4-F2XMpi5EayJFpQGRfR5sWZKsel4-ThHf2FJ8KfEyfWIkykQavCBTmDJ2rwTBt-XMN5IdC-I-0lmMdznOyRQnuOmfi6I2fxaAjJ44dOS_iRMtTRWaiOJnpzfEaOMJqPPsejI1eO0Dq65pipo7pxBXmh7IjNo8eFL-OJ98WFri-SpUfssMfJJ-g5SfClD21O9Ma_CkmjHT1y4cTvIc8T1KIILe9xB1eKrjvcHaKOfqGCH_kl5MdzVIeYDaF7PE_wGz181MPHVPgTtMtxFqnyo1dwftCLD59SuMpUHNcS5fgC_cHRJ0c0FWesIddxiYfv48f5QFSDHqfEC18iF2yqoZ-IGy_4HI7P4MmFQ7-Fp4jMDB4poY-Qq2hfoflu6MePShtcRbEgPx3M7MjTFB9wD-kPbT2e4xw062iy41eQi0_xG1UPP8hfaMcF_NBp_Li5CXeCUD4a58gzfIF2WMqPX0YOf2jz40V53EeuQMjx6Jh1PCyF_2CafSiPE0-O_DjEh-h5vD_eH1xxPnh0RHqOW6hEhgj74HqKJ_kCC31QHj6uJIGZG9qzC5WZw2gXjIm0BBcqE2EeBox9-Eefoz6Pk6AsHpV5XMcvRMcl5fikQEf1OGhYHTlCdsaD_rhq9Dw-iBvRE_pS4niGWzq-XMQZlCUcQa-Rk0Kfoz-uHU3wG111bLuGlgdLHn2KPBBVIld1vMZ97NNxfC-iHx_eI7L2wpdwXFR-7Dr6HNkFvUF0ZRKuBc_RoYlzHHm6Q69xD8cPfWji5JiIp-hP-Icn6ugjFZ1keIamC_eKnNlRGx-eKfiDMNeR5PiD8_h0NJSUaEcfHb_xGu9wFB-cRzfqF48fzD1y6RPUHzl-PNJxxJOUowsPKgfOB2eOi53Qo9pBQsyt4bSCOxt--OjxH1Wz4NnRTyFe4rSE5vThH7or-EfnzXh64cd32A1O_If-Bl-KHuK1YldQURJw_ET_wc-htUpRscfThBKRS8zxA4cf3B52bkFsFkp75NmNf0V1XJuDS9HwXUej8ESfhcEb9MePR4ryw9mkPBjCQ8ez8Pg45Eepo9lS9CmyxE2hR9nBI8wx6XOERjt68fh9JFfmISd6JUeTs8aJK-lxSYly-EiW5wg_nHiS4uaKnWEW4uATbNfxGCfxRME_XBfC5NBwIXxw_LAcHCeMLziwHzeMH_8Q7zisHxXF4I-AUOLhV0WrBj9qpMeh4SPCH3YRTsyODz-0HznhXfhyODXITXBkwWSFkGQK7RLSvuiLUyGuH2G4hUIfepB-5MZ7tIR_7Mghwz_-HLaOD_UDHaGUY0odnMep5GB-5GEHTcrCo3YT2GGD-sIPjZ4CJYjwACghDEEAEcGcMEpI5QhDzgDEBBBIAAIAQFAKCAREQoiBiAUIUAUcEAQYYgAWBikiCmFCEEeAMKQggJhBQAECEDDMKcqUE8Q4hIQQAkmjAaUMASEIUQJ4KxAhgDJDmFBIEII4yEAKAAwiUBCivEVOAYCYUI4aIIxQQJiBjBIMSAAEJQQhYAAQiAqDgAIKAEYIsIARQQBSwDIHDCREOSCUQoAig4w4xhBAAEKIAgGIAMAoAAwzGhCGkAFGGCEAEgwQhoBjAgkggRECSIGEIsIgxKgADgBDrJNCcGoIBQ44AIAQEDFGkGDLWIK0KIYwZQxyBAGgiFWKVBKEYUIiAAEBQAmhiFACQCYBAVIwQiwTxkgEnAIKSECUAAwg44gwSAgDADOIIMIMRQAYZ5SRQhiAKLBQAQEQE8YRAAQCSBGmAGHUIaIE8eAogIRRQhgPDKCKKUcYE4QY4AghAgGhBEMMGeG8IFACIqxhgiFokMKCQEAAIIYQAgAwQAFgmOSIKAIcgAIQ4pQyRkmAiCOCCGAoABAgAwwiAAzGmEGAMIQIQMACJhQwhFhFiADOEEOVJ0IAgRgxxlGBkAFKOgCMEEgBAzxkEBGCoFWKMACAAAMwZZFyqAkHAFFKgkYURUAAQIQwhCijBCCKOASUAEIIACBwQAkjCEGCaAFREoQYgK0RCAA0hKpAGKOFUIQYJKAwCgGlBFAIAOEEMJIwBqgBBBgmFJYCEEzBA4aA5DCSRBADhFRCGAcIIE4QYAwhBBGjLLuAIAKIIghIYowCCUAGFCEMAIYUIgsKwYChCAhhhAIGOKMEQw4g4ARA4BgBABVAAEpYQEgoIIBFwAgBBAIEBCEAsJgRo4UTAhAAAADACGAJMEQQABAgoiFkuAGCEqIUMoYTSQR1gBClECBOEGCUQYICZJEiCAHKCIFIUAAQEEoAgQQCyoBBBEEINBKUAEIJIACTAA
END
The result of the AcoustID request is for this example:
{
"status":"ok",
"results":[
{
"recordings":[
{
"duration":360,
"title":"Strange Machines",
"id":"1ba91dc9-0ada-48b9-98f5-837d08c1b786",
"artists":[
{
"id":"004e5eed-e267-46ea-b504-54526f1f377d",
"name":"The Gathering"
}
]
},
{
"duration":364,
"title":"Strange Machines",
"id":"4eb1e450-4cc7-405e-865c-97ab1e14e6c3",
"artists":[
{
"id":"004e5eed-e267-46ea-b504-54526f1f377d",
"name":"The Gathering"
}
]
}
],
"score":0.947851,
"id":"3bd95f49-b3a6-404c-ac93-59bc014941dc"
}
]
}
Therefore, for this example, the correct song title (recordings) were
detected. Anyway, most probably only 1 of 10 results could be really
assigned to a recording (which might improve a bit in the future when
more fingerprints have been assigned to recordings). But even then there
would be around 30k recordings with links to a video/audio site (e.g.
for preview).
Additionally a file named "youtube-blocked-in-germany.txt" is stored in
the zip archive. The given URLs could not be crawled (nor played) within
Germany due to copyright issues (with the GEMA), most often the songs
are from UMG and SME... Therefore, the links could be added to the
result list, if they are crawled from a country in which Google license
the given songs...
The raw data (without assignment to recording IDs, just the fingerprints
and URLs) could be downloaded here: (Beware, download is 721 MB)
http://www.exactaudiocopy.de/fingerprint-crawling.zip
The data is public domain.
As I don't want to be involved with adding the data to the Musicbrainz
database (if this is wanted at all), I don't care what you are doing
with the crawled information...
Best regards,
Andre
as the other automatic crawlers I planned made no sense anymore, as some
features to be extracted could not be handled by the database, I decided
for a different automatic crawler.
As an experiment, I performed a search for various artists and songs,
which are well known, on YouTube and Soundcloud. The retrieved URLs are
stored. In a second step these URLs were automatically crawled (using
youtube-dl.exe and fpcalc.exe from AcoustID).
The result are >100k YouTube links and a little less than 200k
Soundcloud links. Of course quite many of these links will not contain
what their title promises, but for that the fingerprint information
would help to provide correct assignment to recordings.
A sample entry from the resulting text files looks like:
BEGIN
DURATION=364
FINGERPRINT=AQADtEuiSFEUNRG-H-PToVOc4hfuow_h41GOPA_uw_mgHR9uHTVe_IGPY8eLC19GGD-u4tphGTlz4qpxwSqP8ye2bsa7Bd-OF70w_silXJClSJTQvyipHZqRHvfhEU-O-4cPwx-qD7mE5FVC3OkOMNfw4P3BEz72HmlVFe8luOjxLOmCczEeHl-OpscZHeKPw-RRPNYALT_aHXqCV0bzIxd6HqeIvsgv4Yd1VByRR3jQMM3RaxN-fNkxBRWHXMdDXSK0hgkniJPg8wisvDi8Cz98HJM-rJXwB_kDejnxFxql7MKDHNR59LtwkyzWOYi1wwx-mBmPH6-GO0K7wz9cZ4cuHD_-4OyR_kRP4PiBmDt66MjxwTl8rXiOkE5IQz9yubiPRzB3_BfeKDr-cfjQ57iO6IcO5-jaF7-6FBf0Ct_xK0OajUfeTPgV40d_BXYt_Aj_Db3h_DjC7GMG3TJyJscpuI4WnMSXHPlj6ImFMD_6FS_qC_6RH00A_cOj4z76w91yVMcPPzg3HM9KTMnRKyeeztByo_6H_LjQnHHRJ2XxTzR-_M2Qo8cefRBP_LhDGVXCD7oIPDdyJz2aMTt6xE8afH6gqZdwLTquwTz-QJOPx2uwR8WP_yhPNA5x__iFo3_g49rxA26hHmaPHkd1uD5kE1cOG8Z_HD_8HT4OW0qOHrUT_MOPPzi1PLATHp-wHz8ciUN_lDN8HDrhHT98o4eOMzmqnOqQBt_xLIBHRD9-Q8wzdBPx5scLc8QZ5FnwqEHd4XhWHD7CB8-Pq9ADcz3yjbiTVWj0ox7-CT088gifOHihXTFcHv2h4zVyNNXRLvnxhME1STIe1MYP_NiVwePxWgnSL1CdHemXHG3wE3mUQxfCSFSPnime5Wh-VD_y5kg6NMd9OI-HPjz-dcM-PEeyXER44kQTR5WGrqVINCeH4sgX4egF6w9-FRX6w_GMnMfLHJfgP7iNN4Wn4biGZk4a5N0O_biHE_7xG1-VEc8bQhf6Gzd0HemPk8ctnAvhk_BR4-F2XMpi5EayJFpQGRfR5sWZKsel4-ThHf2FJ8KfEyfWIkykQavCBTmDJ2rwTBt-XMN5IdC-I-0lmMdznOyRQnuOmfi6I2fxaAjJ44dOS_iRMtTRWaiOJnpzfEaOMJqPPsejI1eO0Dq65pipo7pxBXmh7IjNo8eFL-OJ98WFri-SpUfssMfJJ-g5SfClD21O9Ma_CkmjHT1y4cTvIc8T1KIILe9xB1eKrjvcHaKOfqGCH_kl5MdzVIeYDaF7PE_wGz181MPHVPgTtMtxFqnyo1dwftCLD59SuMpUHNcS5fgC_cHRJ0c0FWesIddxiYfv48f5QFSDHqfEC18iF2yqoZ-IGy_4HI7P4MmFQ7-Fp4jMDB4poY-Qq2hfoflu6MePShtcRbEgPx3M7MjTFB9wD-kPbT2e4xw062iy41eQi0_xG1UPP8hfaMcF_NBp_Li5CXeCUD4a58gzfIF2WMqPX0YOf2jz40V53EeuQMjx6Jh1PCyF_2CafSiPE0-O_DjEh-h5vD_eH1xxPnh0RHqOW6hEhgj74HqKJ_kCC31QHj6uJIGZG9qzC5WZw2gXjIm0BBcqE2EeBox9-Eefoz6Pk6AsHpV5XMcvRMcl5fikQEf1OGhYHTlCdsaD_rhq9Dw-iBvRE_pS4niGWzq-XMQZlCUcQa-Rk0Kfoz-uHU3wG111bLuGlgdLHn2KPBBVIld1vMZ97NNxfC-iHx_eI7L2wpdwXFR-7Dr6HNkFvUF0ZRKuBc_RoYlzHHm6Q69xD8cPfWji5JiIp-hP-Icn6ugjFZ1keIamC_eKnNlRGx-eKfiDMNeR5PiD8_h0NJSUaEcfHb_xGu9wFB-cRzfqF48fzD1y6RPUHzl-PNJxxJOUowsPKgfOB2eOi53Qo9pBQsyt4bSCOxt--OjxH1Wz4NnRTyFe4rSE5vThH7or-EfnzXh64cd32A1O_If-Bl-KHuK1YldQURJw_ET_wc-htUpRscfThBKRS8zxA4cf3B52bkFsFkp75NmNf0V1XJuDS9HwXUej8ESfhcEb9MePR4ryw9mkPBjCQ8ez8Pg45Eepo9lS9CmyxE2hR9nBI8wx6XOERjt68fh9JFfmISd6JUeTs8aJK-lxSYly-EiW5wg_nHiS4uaKnWEW4uATbNfxGCfxRME_XBfC5NBwIXxw_LAcHCeMLziwHzeMH_8Q7zisHxXF4I-AUOLhV0WrBj9qpMeh4SPCH3YRTsyODz-0HznhXfhyODXITXBkwWSFkGQK7RLSvuiLUyGuH2G4hUIfepB-5MZ7tIR_7Mghwz_-HLaOD_UDHaGUY0odnMep5GB-5GEHTcrCo3YT2GGD-sIPjZ4CJYjwACghDEEAEcGcMEpI5QhDzgDEBBBIAAIAQFAKCAREQoiBiAUIUAUcEAQYYgAWBikiCmFCEEeAMKQggJhBQAECEDDMKcqUE8Q4hIQQAkmjAaUMASEIUQJ4KxAhgDJDmFBIEII4yEAKAAwiUBCivEVOAYCYUI4aIIxQQJiBjBIMSAAEJQQhYAAQiAqDgAIKAEYIsIARQQBSwDIHDCREOSCUQoAig4w4xhBAAEKIAgGIAMAoAAwzGhCGkAFGGCEAEgwQhoBjAgkggRECSIGEIsIgxKgADgBDrJNCcGoIBQ44AIAQEDFGkGDLWIK0KIYwZQxyBAGgiFWKVBKEYUIiAAEBQAmhiFACQCYBAVIwQiwTxkgEnAIKSECUAAwg44gwSAgDADOIIMIMRQAYZ5SRQhiAKLBQAQEQE8YRAAQCSBGmAGHUIaIE8eAogIRRQhgPDKCKKUcYE4QY4AghAgGhBEMMGeG8IFACIqxhgiFokMKCQEAAIIYQAgAwQAFgmOSIKAIcgAIQ4pQyRkmAiCOCCGAoABAgAwwiAAzGmEGAMIQIQMACJhQwhFhFiADOEEOVJ0IAgRgxxlGBkAFKOgCMEEgBAzxkEBGCoFWKMACAAAMwZZFyqAkHAFFKgkYURUAAQIQwhCijBCCKOASUAEIIACBwQAkjCEGCaAFREoQYgK0RCAA0hKpAGKOFUIQYJKAwCgGlBFAIAOEEMJIwBqgBBBgmFJYCEEzBA4aA5DCSRBADhFRCGAcIIE4QYAwhBBGjLLuAIAKIIghIYowCCUAGFCEMAIYUIgsKwYChCAhhhAIGOKMEQw4g4ARA4BgBABVAAEpYQEgoIIBFwAgBBAIEBCEAsJgRo4UTAhAAAADACGAJMEQQABAgoiFkuAGCEqIUMoYTSQR1gBClECBOEGCUQYICZJEiCAHKCIFIUAAQEEoAgQQCyoBBBEEINBKUAEIJIACTAA
END
The result of the AcoustID request is for this example:
{
"status":"ok",
"results":[
{
"recordings":[
{
"duration":360,
"title":"Strange Machines",
"id":"1ba91dc9-0ada-48b9-98f5-837d08c1b786",
"artists":[
{
"id":"004e5eed-e267-46ea-b504-54526f1f377d",
"name":"The Gathering"
}
]
},
{
"duration":364,
"title":"Strange Machines",
"id":"4eb1e450-4cc7-405e-865c-97ab1e14e6c3",
"artists":[
{
"id":"004e5eed-e267-46ea-b504-54526f1f377d",
"name":"The Gathering"
}
]
}
],
"score":0.947851,
"id":"3bd95f49-b3a6-404c-ac93-59bc014941dc"
}
]
}
Therefore, for this example, the correct song title (recordings) were
detected. Anyway, most probably only 1 of 10 results could be really
assigned to a recording (which might improve a bit in the future when
more fingerprints have been assigned to recordings). But even then there
would be around 30k recordings with links to a video/audio site (e.g.
for preview).
Additionally a file named "youtube-blocked-in-germany.txt" is stored in
the zip archive. The given URLs could not be crawled (nor played) within
Germany due to copyright issues (with the GEMA), most often the songs
are from UMG and SME... Therefore, the links could be added to the
result list, if they are crawled from a country in which Google license
the given songs...
The raw data (without assignment to recording IDs, just the fingerprints
and URLs) could be downloaded here: (Beware, download is 721 MB)
http://www.exactaudiocopy.de/fingerprint-crawling.zip
The data is public domain.
As I don't want to be involved with adding the data to the Musicbrainz
database (if this is wanted at all), I don't care what you are doing
with the crawled information...
Best regards,
Andre