Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ostensibly copyright is there to increase economic incentives to make things it protects, and like you said, we can massively cut it down without affecting much there. So focusing on economic viability, set it to something like 15 years for code and 20-30 for everything else. Require registration for everything and source escrow for code and digital art to be granted copyright. That would give a wealth of code to train on already even without people who would be fine freely giving it away. There's also government code as a relatively large public domain source for recent material.

Like I said science has mostly been stolen, and has no business being copyrighted at all. The output of publicly funded research should immediately be public domain.

Anyway this is beside the point that model creation is wealth creation, and so by definition not rent-seeking. Lobbying for a government granted monopoly (e.g. copyright) is rent-seeking.



Exclusively training on 15 year source code would make code generation significantly less useful as API’s change.

Economic viability and utility for AI training are closely linked. Exclude all written works including news articles etc from the last 25 years and your model will know nothing about Facebook etc.

It’s not as bad if you can exclude stuff from copyright and then use that, but your proposal would have obvious gaps like excluding works in progress.


You wouldn't need to exclusively train on 15 year old source code. What I said would simply grant you free access to all 15 year old source code, but you can already train on public domain code and likely any FOSS code without any issue, or if courts do start deciding that models inherit copyright, at the most you might have to link a list of all of the codebases you trained on with license info. The nature of the thing is that any code it spits out is already in source form, so the only missing part is the notice.

I suppose we all exist in our own bubbles, but I don't know why anyone would need a model that knows about Facebook etc. In any case, it's not clear that you couldn't train on news articles? AFAIK currently the only legal gray area with training is when e.g. Facebook mass pirated a bunch of textbooks. If you legally acquire the material, fitting a statistical model to it seems unlikely to run afoul of copyright law. Even without news articles, it would certainly learn something of the existence of Facebook. e.g. we are discussing it here, and as far as I know you're free to use the Hacker News BigQuery dump to your liking. Or in my proposed world, comments would naturally not be copyrighted since no one would bother to register them (and indeed a nominal fee could be charged to really make it pointless to do so). I suppose it is an important point that in addition to registration, we should again require notices, maybe including a registration ID.

Give a post-facto grace period of a couple weeks/months to register a thing for copyright. This would let you cover any work in progress that gets leaked by registering it immediately, causing the leak to become illegal.


>> It’s not as bad if you can exclude stuff from copyright and then use that

Making a copy of a news article etc to train with is on the face of it copyright infringement even before you start training. Doing that for OSS is on the other hand fine, but there’s not that much OSS.

I think training itself could reasonably be considered fair use on a case by case basis. Train a neural network to just directly reproduce a work being obviously problematic etc. There’s plenty of ambiguity here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: