Ostensibly copyright is there to increase economic incentives to make things it ...

Retric · 2025-09-30T14:54:51 1759244091

Exclusively training on 15 year source code would make code generation significantly less useful as API’s change.

Economic viability and utility for AI training are closely linked. Exclude all written works including news articles etc from the last 25 years and your model will know nothing about Facebook etc.

It’s not as bad if you can exclude stuff from copyright and then use that, but your proposal would have obvious gaps like excluding works in progress.

ndriscoll · 2025-09-30T17:19:32 1759252772

You wouldn't need to exclusively train on 15 year old source code. What I said would simply grant you free access to all 15 year old source code, but you can already train on public domain code and likely any FOSS code without any issue, or if courts do start deciding that models inherit copyright, at the most you might have to link a list of all of the codebases you trained on with license info. The nature of the thing is that any code it spits out is already in source form, so the only missing part is the notice.

I suppose we all exist in our own bubbles, but I don't know why anyone would need a model that knows about Facebook etc. In any case, it's not clear that you couldn't train on news articles? AFAIK currently the only legal gray area with training is when e.g. Facebook mass pirated a bunch of textbooks. If you legally acquire the material, fitting a statistical model to it seems unlikely to run afoul of copyright law. Even without news articles, it would certainly learn something of the existence of Facebook. e.g. we are discussing it here, and as far as I know you're free to use the Hacker News BigQuery dump to your liking. Or in my proposed world, comments would naturally not be copyrighted since no one would bother to register them (and indeed a nominal fee could be charged to really make it pointless to do so). I suppose it is an important point that in addition to registration, we should again require notices, maybe including a registration ID.

Give a post-facto grace period of a couple weeks/months to register a thing for copyright. This would let you cover any work in progress that gets leaked by registering it immediately, causing the leak to become illegal.

Retric · 2025-09-30T18:26:54 1759256814

>> It’s not as bad if you can exclude stuff from copyright and then use that

Making a copy of a news article etc to train with is on the face of it copyright infringement even before you start training. Doing that for OSS is on the other hand fine, but there’s not that much OSS.

I think training itself could reasonably be considered fair use on a case by case basis. Train a neural network to just directly reproduce a work being obviously problematic etc. There’s plenty of ambiguity here.