Mod Archive Forums Mod Archive Forums
Advanced search  

News:

Please note: Your main modarchive.org account will not work here, you must create a forum account to post on the forums.

Pages: [1]   Go Down

Author Topic: Related File Analysis  (Read 4377 times)

0 Members and 1 Guest are viewing this topic.

iwalton3

  • New User
  • Offline Offline
  • Posts: 3
    • View Profile
    • iwalton.com
Related File Analysis
« on: February 17, 2020, 07:28:13 »

Hello!

I recently processed the archives from modland, modarchive, and the keygen music library. I extracted all of the samples from these files and hashed them, allowing cross-comparison across the files. This means if you have one file you like, you can find other ones that are similar. This is great for finding remixes and similar songs. (And also for finding rippers as well.)

I'd like to also use this information to understand more about what artists and genres of module files, but unfortunately I do not see any database dumps available and I don't want to scrape this site. If you are interested in helping my work, I would greatly appreciate a TSV (preferred) or CSV dump containing the MD5 sum, numeric ID, genre, and artist of the module files. (This also makes my data more useful to others.)

Regardless, you can find my analysis of the files here: https://iwalton.com/ushare/sample-index.gz

The format of the file is TSV, with the columns: File MD5, Sample MD5, Sample Length, Loop, Sample Name

Additionally, here are files that also include the names of the files:
 - Prefers Modland: https://iwalton.com/ushare/sample-index-incl-filename.gz
 - ModArchive Only: https://iwalton.com/ushare/sample-index-modarchive.gz

If you'd like to find files similar to one you know of, you can use this shell command:

grep "md5-sum-of-file" sample-index | cut -f 2 | while read -r line; do grep "$line" sample-index | sort -u; done | cut -f 1 | sort | uniq -dc | sort -h

Edit: Also, here are computed ReplayGain values for all of the files too. The values were calculated by converting every mod file to a flac and running python-rgain over them. https://iwalton.com/ushare/module-replaygain.gz

Edit 2: Here are guessed sample names: https://iwalton.com/ushare/sample-names
Rougly 7.6 percent of the extracted samples have guessed names. Many of them may be incorrect. These were automatically guessed with this script: https://pastebin.com/raw/Xn5N0ycs
« Last Edit: February 18, 2020, 22:32:50 by iwalton3 »
Logged

Saga Musix

  • TMA Moderator
  • Top Poster
  • ****
  • Offline Offline
  • Posts: 2571
  • I love OpenMPT! And Modules! And TMA! And Pie! :>
    • View Profile
    • Saga Musix - free module music and more!
Re: Related File Analysis
« Reply #1 on: February 17, 2020, 21:42:28 »

I can look into providing a dump, but probably not before the weekend.
Logged
» My TMA artist profile
» Visit my music site: https://sagamusix.de/ [de, en]
» Visit my programming website: https://sagagames.de/ [de]
» Open ModPlug Tracker

iwalton3

  • New User
  • Offline Offline
  • Posts: 3
    • View Profile
    • iwalton.com
Re: Related File Analysis
« Reply #2 on: February 17, 2020, 23:37:01 »

I look forward to it. I'll provide any information I find back, as well as a file with the actual modarchive ids. It would be interesting if this could identify some common samples that certain genres used. Finding remixes of existing files has already proven interesting.
Logged

iwalton3

  • New User
  • Offline Offline
  • Posts: 3
    • View Profile
    • iwalton.com
Re: Related File Analysis
« Reply #3 on: February 25, 2020, 23:18:57 »

I was able to write a simple genre classifier using the genre information. It has a decent reliability considering the number of genres, but it is far from perfect.

The results are in two files. The mod_genre_results.tsv includes an analysis for all mod files I have, while mod_genre_results_only_modarchive.tsv is specific to modarchive.
The format is TSV, with the following fields: md5, modarchive_id, genre_id, genre_name, confidence

Download: https://iwalton.com/ushare/modarchive_genre_classifier.7z

Here is some accuracy information. It typically identifies genres really well for genres such as trance and chiptunes where there is a heavy correlation between genre and instruments.

Total Songs Tested: 2000
Total Songs Identified: 1763
Identified Songs Correct: 38.45717526942711%
Total Songs Correct: 33.900000000000006%

With broad genre categories instead of subgenres:
Identified Songs Correct: 49.80147475893364%
Total Songs Correct: 43.9%
Logged
Pages: [1]   Go Up