Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Postgres as a Search Engine (anyblockers.com)
155 points by philippemnoel on Aug 25, 2024 | hide | past | favorite | 39 comments


I would add: you should look for alternative solutions when you need to search anything other than English.


Or when you care at all about optimizing your search ranking. Postgresql has a few alright tools to do vector search and to do some simple things with trigrams. And that can work well for some narrow use cases. If that's all you need, great. But this doesn't give you a whole lot of control over search ranking. And if search is in anyway on the critical path to revenue for whatever it is your product does, you probably need to invest in making sure your users find what they need. And that means getting a bit systematic about things, using specialized tools, and making sure you have some skills on your team that know how to do this.


You can use custom ranking functions (shown in the article) add weights to columns, etc.

What techniques are you talking about that cannot be implemented in Postgres?


The hundreds of features things like Solr and Elasticsearch/Opensearch have. These are complex products but mostly this is a form of necessary complexity.

To be blunt, the intersection of people that know what they are doing on this front and people that choose to use Postgres for this is pretty narrow. It does happen and I've seen a few nice things built with Postgres. But mostly it's just people using the wrong tool for the job.


This is partially true. You're right that the feature set is necessary, but it's now possible to do this in Postgres.

ParadeDB pg_search bundles Tantivy, a Lucene-inspired library inside Postgres, to give users the feature set (BM25 ranking, tokenizers, faceted search, etc.) while keeping the benefits of Postgres (keep your data normalized, avoid ETL, etc.)

Disclaimer: I work on ParadeDB


I’ve been using pg_search for a tranche of emails in 2+ GB PDFs converted to PDF-page-per-row. Along with some materialized views for regex extracted timestamps it has given us a means for data forensics that has been very fruitful.

Orchestrating this outside of a few Python scripts and a single instance of postgres would have taken much more work!


So happy to hear you like it. We have a lot more coming up for it. Stay tuned!


I used to have this same argument but apart from the few that I've used on solr, it is not trivial to have general search using it. Won't even comment on ES b/c they already target analytics better than search. I think it is worth exploring pg and other tools as all search cases are narrow/specific (ecomm, graphs, domain-specific documents etc), specially if you need facets and filtering. Also multilanguage ok to consider for a tool but products usually look for better recall at their original lang then to have same results in other languages.


I agree with you, but to be fair, that's true of virtually every technology I've encountered in my career.


But tsvector supports all sorts of different languages, at least western.


ParadeDB pg_search supports tokenizers in all major languages


Good find/tip!

It took me a while to find it, it looks like they support it via Snowball: https://github.com/snowballstem/snowball

I wish ParadeDB exposed multi-language search capability more prominently in the docs.


pgroonga is a good Postgres option in these cases

https://supabase.com/docs/guides/database/extensions/pgroong...


It's a bit weird that the pages claim to support all languages, and all examples are in English. Even the documentation for Groonga, on which it's based doesn't talk about multiple languages.

It looks like at best it makes Postgres aware of more characters



I saw that they bundle MeCab and Japanese dictionaries on Windows, but that's about it for multi-language support.


Eh, why? Have you ever heard of transliteration?


Because search in different languages needs to be aware of that language's peculiarities.

Including but not limited to things that English doesn't have or only has in vestigial forms like grammatical cases, complex word morphology, declensions etc.


For example, it took Google years before they properly implemented search in Russian. That allowed Russian Yandex to win Russian market.


Back atcha: Have you ever heard of stemming?

That'll be different depending on the language, you can't simply transliterate words to another language, the mapping may change things and the rules might be different.

For example, in English "manage / managers", are practically the same, but transliterating to Spanish can give you "gestionan / gerentes", which diverge very early on.


Alternatively you could use "gestor", which has the same stem. "Gestionar" and "gerente" have different origins, so no wonder they look very different.

But yes, most (all?) European languages have vastly more complex morphology than English.


It may be a silly question, but isn't there really a simple to use full-text search solution that has all complicated multi-language tricks baked in for all major languages? Or, well, at least European ones.

It was a really, really hard task 20 years ago, but I'd imagine that now there must be a drop-in grep/ag replacement for natural languages that you run once to build an index and it takes care of all this stemming, semantic embeddings and all other clever specialized things for you. Isn't there one?

And if no, what tools/libraries do exist in this area? To make something more sophisticated than in this post?


I would recommend you give ParadeDB pg_search a look. It's got support for tokenizers in all major languages. You can think of it as the next generation tsvector for more complex workloads.

Disclaimer: I work at ParadeDB


I wrote a post how to do full-text search back in 2018:

https://austingwalters.com/fast-full-text-search-in-postgres...

Imo custom indexes are the real key to more accuracy and speed. That said, if you have <100m documents the built in search functions are great and really depends on your speed requirements.


Great article but some benchmarks/profiling is missing.

FTS and trigram can perform quite poorly unless the data and indices are tuned properly.


1. Compared with column storage, the performance of vectorized search is relatively poor.

2. Postgre is not serverless, so it is not easy to separate read and write, and it is not easy to auto scaling


> 2. Postgre is not serverless, so it is not easy to separate read and write, and it is not easy to auto scaling

By the time you're hitting the limitations of vertical scaling a single do-everything instance of Postgres on cloud infrastructure, you're making boatloads of money and can afford to stand up something else for search. And besides which, creating read replicas horizontally is very doable.

Though to be fair, I wouldn't implement moderately complex search on postgres, just because there are better tools for the job. Keeping data consistent between multiple systems though is "involved", and there's therefore a good argument for doing search in Postgres if your needs are simple.


You don't have to be that big for mixed workloads to cause issues for a do-everything PG instance.

Imagine a scenario where read-heavy but infrequent search queries end up pushing, say, your sessions table out of cache.

Postgres has no facilities for earmarking cache for one table vs another, so the noisy neighbor problem is real, and hard to fix. You can throw money/ram at it, but that's needlessly expensive if you have some workloads that don't require that level of performance.


You can get 512GB RAM and 96 core machines for $1000 / month and at that point you can throw any workload on it.


Parent comment already mentioned read-replicas.

The main problem I've seen is companies allowing tables to grow enormous because they never partition (by year, for instance) or archive out old stale data.


Multi-master is not common, but it's easy to have a single master and multiple read-only slaves (with failover), though of course you're going to have to configure your application's database/ORM layer to handle multiple servers. That requires a bit of effort, but then you're set for running analytic queries on a completely different database or column store later on if you choose to do so.

I'm not saying you don't need multi-master, but I've worked on several large projects and one Postgres database can handle a lot of traffic. My first solution is to offload analytic queries to read-only instances or pull data into a column store for "offline" processing. Just make sure you don't get stuck into some ancient ORM or application framework.

There are several Kubernetes operators that are moving towards more complex topologies, so I think a lot of innovation and progress is happening somewhat outside of core Postgres itself, building on functionality already present within.


While both points are true now, there is a lot of work happing to bring both column storage and separate compute and storage to Postgres.

The pg_duck project has the eventual aim to implement a column storage engine for Postgres. There are a few steps to get there as it needs to be tied into the Postgres page storage and replication system. So it's not solved by the first version of pg_duck, but the team is incredible and I believe it will happen.

Neon and Oriole (acquired by Supabase) are both open source and separate storage and compute. There is a few steps more for them to go to be truly usable self hosted, but they will get there, and some of the work they are doing will hopefully be upstreamed.


I mean I guess, but why not just use a lucene based system?


Keeping it in Postgres avoids needing to denormalize the data, which is a big pain point. ParadeDB pg_search is based on Tantivy, a Lucene-inspired library, so in some ways it allows you to both benefit from a Lucene-based system and also keep it in Postgres


Because that's another thing to set up, operate and pay for.


SQLite ver:

1. Full-text search with FTS5

2. Semantic search with sqlite-vec

3. Fuzzy matching with FTS5 trigram tokenizer

4. Bonus: FTS5 bm25() function


FTS5 is amazing.

Is there anything that's as small and easy to use as FTS5 for indexing text files or JSON documents on the file system?


I wrote about this too[1], and since rqlite[2] puts a HTTP API in front of SQLite, you've got a SQLite-backed search engine available over the network.

[1] https://www.philipotoole.com/building-a-highly-available-sea...

[2] https://rqlite.io

Disclaimer: I'm the creator of rqlite, and it's not the only piece of software to make SQLite available over the network.


I'd like to see that ! been fiddling with sqlite and fts5 to drop algolia from my application.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: