The biggest problem with newsreaders, IME, has been managing large numbers of feeds. Most user time is spent handling redundant stories - e.g., if you have feeds from many major news sources, for each major event you get one or more stories on each feed, saying mostly the same things.
I haven't seen a newsreader solve that problem. Has anyone tried an LLM?
The best solution I know is grouping redundant stories together, possibly hierarchically: e.g., Sports > Olympics > Figure skating > Jones performance. (Fewer feeds require fewer levels, possibly just one.)
That ~ deduplicates the stories and, by displaying them together, you can compare and choose the coverage you like and delete the rest. Otherwise, IME most user time is spent sorting through redundant stories one at a time.
But as I said, I haven't seen a newsreader do that well. It seems like a good fit for LLMs. Or maybe there's another solution besides grouping?
My YOShInOn RSS reader uses an SBERT model for classification (will I upvote this or not?) and large-scale clustering (20 k-means clusters and show me the top N in each cluster so I get a diversity of different articles.)
and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that.
Effective clustering is the really interesting technology Google News has had for a long time.
I have been attempting this exact sort of clustering solution for a few years now (on and off as a side project). Do you have source code available, or more detailed explanations/resources of how to approach this?
Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years.
You specify your interests as free form text, it ranks articles by how closely they match, and you can consume your Scour feed as an RSS feed to read it in NNW.
I've been thinking of how to tackle that problem. It would require a bit of resources but nothing too crazy. Essentially new articles need to be indexed in some kind of vector search capable DB. That allows things like similarity grouping and a few other things. This is nothing new and exactly how things like Google News work. The difference here would be keeping the per user notion of subscribing only to things they care about.
If you do the embeddings calculation centrally, it becomes shared cost. Every new article gets analyzed only once for all users.
The rest then becomes providing a new view on your RSS feeds that leverages that. You could do a lot of the expensive stuff (vector comparisons) locally actually because most users only have hundreds/thousands of articles that they care about. So, simply download the embeddings for articles and do the comparisons/grouping locally.
This wouldn't be super hard to do. There are lots of OSS models that you can run locally as well. But they are kind of slow. So the trick is to amortize that over many users and share the burden.
The key challenge here is the finances. The centralized embeddings juggling gets costly quickly and you need a revenue model to finance that. That's why most of this stuff is happening by paywalls and staying kind of niche. All the "free" stuff is essentially ad sponsored.
But with some MCP layered on top and a few other bits and bobs, you could fairly easily implement an intelligent LLM based news agent that summarizes personalized news based on exactly your own preferences and news subscriptions. I haven't really seen anything like this done right. But we technically have all the OSS tech and models to do all of this now. It's just the compute cost that kills the use case.
If that could be decentralized bittorrent style, it wouldn't actually be that much of a burden. Given enough users, distributing say thousands of article updates per minute among tens/hundreds of thousands of readers means each of them expending maybe a couple of seconds of compute once in a while to calculate embeddings for articles that they are pulling that don't have embeddings yet. If you make that eventually consistent, it's not that big of a deal if you don't get embeddings for all the new stuff right away. And any finished embeddings could be uploaded and shared. Anything popular would quickly get embeddings. And you could make the point that publishers themselves could be providing embeddings as well for their own articles. Why not? If you only publish a handful of articles, the cost is next to nothing.
If I had more spare time, I might have a go at this. Sadly, I don't.
It's very interesting and thanks for laying out the issues.
One point I'd like to make: Grouping RSS feed items by similarity is much different than LLM summaries. In the former, I get the best (I can use economically), specific, expert human information, which is what I'm personally after; the latter keeps the economy but eliminates the value.
I haven't seen a newsreader solve that problem. Has anyone tried an LLM?
The best solution I know is grouping redundant stories together, possibly hierarchically: e.g., Sports > Olympics > Figure skating > Jones performance. (Fewer feeds require fewer levels, possibly just one.)
That ~ deduplicates the stories and, by displaying them together, you can compare and choose the coverage you like and delete the rest. Otherwise, IME most user time is spent sorting through redundant stories one at a time.
But as I said, I haven't seen a newsreader do that well. It seems like a good fit for LLMs. Or maybe there's another solution besides grouping?