The biggest problem with newsreaders, IME, has been managing large numbers of fe...

PaulHoule · 2026-02-11T20:26:30 1770841590

My YOShInOn RSS reader uses an SBERT model for classification (will I upvote this or not?) and large-scale clustering (20 k-means clusters and show me the top N in each cluster so I get a diversity of different articles.)

For duplicate detection I am using DBSCAN

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that.

Effective clustering is the really interesting technology Google News has had for a long time.

benwills · 2026-02-11T20:52:15 1770843135

I have been attempting this exact sort of clustering solution for a few years now (on and off as a side project). Do you have source code available, or more detailed explanations/resources of how to approach this?

Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years.

PaulHoule · 2026-02-11T20:55:58 1770843358

The technical report on YOShInOn is about 2 years overdue!

You mean the k-means for diversity or DBSCAN for duplicates? Either way it is about 10 lines of scikit-learn code. Send me an email.

benwills · 2026-02-11T21:12:41 1770844361

Both. Just sent an email. Thanks!

dewey · 2026-02-11T19:48:06 1770839286

That was partially the original promise of Fever, which is the API many RSS services still support and that somehow lives on.

Nuzzle did something similar for Twitter but shut down (https://daringfireball.net/linked/2021/05/05/nuzzel).

That would be a good addition to feed readers, especially for news feeds.

emschwartz · 2026-02-11T21:27:18 1770845238

You should try Scour (https://scour.ing)!

You specify your interests as free form text, it ranks articles by how closely they match, and you can consume your Scour feed as an RSS feed to read it in NNW.

Disclaimer: I’m the developer

71bw · 2026-02-12T13:44:52 1770903892

HSTS issue, can't access.

cosmic_cheese · 2026-02-11T19:37:07 1770838627

I haven't used it much but I think Iconfactory's Tapestry[0] does some of this.

[0]: https://usetapestry.com/

jillesvangurp · 2026-02-12T13:19:39 1770902379

I've been thinking of how to tackle that problem. It would require a bit of resources but nothing too crazy. Essentially new articles need to be indexed in some kind of vector search capable DB. That allows things like similarity grouping and a few other things. This is nothing new and exactly how things like Google News work. The difference here would be keeping the per user notion of subscribing only to things they care about.

If you do the embeddings calculation centrally, it becomes shared cost. Every new article gets analyzed only once for all users.

The rest then becomes providing a new view on your RSS feeds that leverages that. You could do a lot of the expensive stuff (vector comparisons) locally actually because most users only have hundreds/thousands of articles that they care about. So, simply download the embeddings for articles and do the comparisons/grouping locally.

This wouldn't be super hard to do. There are lots of OSS models that you can run locally as well. But they are kind of slow. So the trick is to amortize that over many users and share the burden.

The key challenge here is the finances. The centralized embeddings juggling gets costly quickly and you need a revenue model to finance that. That's why most of this stuff is happening by paywalls and staying kind of niche. All the "free" stuff is essentially ad sponsored.

But with some MCP layered on top and a few other bits and bobs, you could fairly easily implement an intelligent LLM based news agent that summarizes personalized news based on exactly your own preferences and news subscriptions. I haven't really seen anything like this done right. But we technically have all the OSS tech and models to do all of this now. It's just the compute cost that kills the use case.

If that could be decentralized bittorrent style, it wouldn't actually be that much of a burden. Given enough users, distributing say thousands of article updates per minute among tens/hundreds of thousands of readers means each of them expending maybe a couple of seconds of compute once in a while to calculate embeddings for articles that they are pulling that don't have embeddings yet. If you make that eventually consistent, it's not that big of a deal if you don't get embeddings for all the new stuff right away. And any finished embeddings could be uploaded and shared. Anything popular would quickly get embeddings. And you could make the point that publishers themselves could be providing embeddings as well for their own articles. Why not? If you only publish a handful of articles, the cost is next to nothing.

If I had more spare time, I might have a go at this. Sadly, I don't.

mmooss · 2026-02-12T18:34:12 1770921252

It's very interesting and thanks for laying out the issues.

One point I'd like to make: Grouping RSS feed items by similarity is much different than LLM summaries. In the former, I get the best (I can use economically), specific, expert human information, which is what I'm personally after; the latter keeps the economy but eliminates the value.