I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).
The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
Just because it says "CC0" does not make it CC0. If you upload a dataset you don't have the rights to, any license declaration you make is null and void, and anyone using it as if it had that license is violating copyright
Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0
If you buy a pirated book on Amazon you get to keep the book and the pirate printer is the one persecuted.
Same thing applies here.
Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.
Section 116 (2) A plaintiff is not entitled by virtue of this section to any damages or to any other pecuniary remedy, other than costs, if it is established that, at the time of the conversion or detention:
(a) the defendant was not aware, and had no reasonable grounds for suspecting, that copyright subsisted in the work or other subject - matter to which the action relates;
(b) where the articles converted or detained were infringing copies--the defendant believed, and had reasonable grounds for believing, that they were not infringing copies; or
(c) where an article converted or detained was a device used or intended to be used for making articles--the defendant believed, and had reasonable grounds for believing, that the articles so made or intended to be made were not or would not be, as the case may be, infringing copies.
Does this not mean the opposite of your claim? It sounds to me that if you unwittingly bought a dodgy copy of something, the law thinks the copyright owner can get you to pay for a legit copy, but not punish you for your mistake.
In the specific case of the Harry Potter works, the fame might meet the threshold of reasonable grounds for believing, but noosphr's argument that "Up to 80% off all works that are in copyright terms are accidentally in the public domain" could grant a reasonable grounds for believing it is not.
This is one of those things that causes interesting court cases because a reasonable grounds for believing X is not the same thing as not reasonable grounds for believing not X.
Reasonable grounds for suspicion probably carries more weight here than reasonable grounds for the absence of suspicion, but cases have hung on things like this before , like the presence or absence of an Oxford comma.
To clarify: Microsoft linked to a dataset on Kaggle, which is falsely labeled CC0 (Public Domain). It's the fault of the user who uploaded the dataset and misrepresented the licensing.
Multiple failures. One on writer of blog even for a moment considering that such data set would be legal. And next for MS for hiring such a person with that poor judgement. Namely publicly posting about it on company platform. Instead of choosing some other data set.
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.