Spotify has been scraped — jotbook page 4

1. We are going through the great unlearning. App developers make an app today only to realize their moat is nothing but a context-retrieved .md file, and their app as a wrapper on top is nothing but an annoyance that turns people away because they have to use “yet another app”.

2. You might have seen the news that Anna’s Archive has archived, through scraping, the near-entirety of Spotify’s songs catalogue, equivalent to 99.6% of its (ambiguous, but presumably all-time) listens.

I have followed and been a supporter of Anna’s Archive for a long time, they have a very noble mission to preserve human knowledge, and generally try to provide a catalogue of all books and research papers on the planet, for free. They have without a doubt, the most comprehensive website in the world for this.

This is also a significant contribution to where I consider the arc of humanity should bend towards: removing all barriers of entry to knowledge and education, making it accessible to all.

The technical difficulty of going completely undetected by Spotify while scraping 300! terabytes from their platform is a herculean task, and the anonymous people who made this happen could easily be earning upwards of a $quarter-million (or maybe even are) at a Silicon Valley big-tech, but they chose to dedicate their time, and effort instead into moving this goalpost for humanity forward. I think their effort should be celebrated more.
I don’t think this will cause any noticeable increase in piracy at all. Spotify is one of the few success stories in history that coalesced a HEAVILY pirated market back into the mainstream through their incredible user experience and extremely affordable pricing. Piracy has an inconvenience attached to it, and even in lower-income markets like Pakistan, their regional pricing makes them the best, wallet-friendly choice.

Whether they are “fair” in the sense of paying more per-stream to the biggest artists compared to the rest is a debate for another time.

Splitting hairs (a lot of numbers): Though the archive catalogues 99.9% of all songs’ metadata (name, artist, genre etc.) which is important, the actual music archived is just 33.6% of Spotify’s library… this represents 99.6% of what people listen to because popular songs are listened to a lot more, of course: two-thirds of all songs on Spotify are only listened to by 0.4% of the people.
Since AI and platforms like Suno, the number of albums released per year have exploded and a lot might be wholly or partially AI-generated. We might realize many years later how important it was to have a cultural archive of all music pre-AI.

Number of albums released per year:

╢████████████████████ (11.07m)
╢███████████████
╢████████████
╢█████████
╢████████
╢█████
╢███
╢██
╢██
╢██
╢██

The most popular Pakistani song by no. of all-time listens is Jhol, coming in at 792nd on the list overall. Since Spotify is by far the biggest platform and loosely follows the rankings of others, we can pretty much be certain… Jhol is the 792nd most overall, and most popular Pakistani song in the history of music. Congrats to Maanu and Annural Khalid?!
Is Anna’s Archive “Russian-backed”? I don’t know. But I do think even if this is somewhat true, this is very immature fear-mongering.

The talented engineers, and I suspect the vast majority of funders and incredibly intelligent people who came together and volunteered everything to make this happen, didn’t do it because they personally knew the leadership or funders, they were in it for its mission, and if that mission is achieved it will be an incredible step forward for humanity, whoever is at the helm of it.

And secondly, perhaps its something to think about whether Western jurisdictions are long-overdue an overhaul in terms of how they balance intellectual property vs. accessibility to knowledge and culture and use actual studies of why piracy occurs instead of relying on Cold-War-era “communism” fear-mongering to define their policies.