As someone who has actively participated in DDH for a while now, here are my vie...

romo3 · on Sept 10, 2017

Kiwix - most people are too conditioned to think that search has to happen online and don't even realize what is possible offline.

Entire web archives such as the entire dump of wikipedia and stackexchange (including media and indexes for search) can be stored locally. The missing piece is Google level search quality on the local machine. Given that brute force substring search can process Gigabytes in seconds nowadays. If you have enterprise grade server hardware things are reaching 1000GB/s. At this rate, there is no reason to think in a couple years local search of all known human knowledge can't happen on a local device at Google level result quality.

For anyone interested in the search space look into whats possible today in local offline search.

tammer · on Sept 11, 2017

This is a great observation & seems to dovetail with technologies like IPFS.[1]

[1]: https://ipfs.io

amelius · on Sept 10, 2017

You might be right, but human knowledge is also expanding, of course. The question is: will it expand faster than hardware capabilities?

Anyway, I wish we'd see more search and NLP related posts here on HN. It deserves far more attention than it gets.

romo3 · on Sept 10, 2017

For the average person this rate does not matter. They don't need access to the cutting edge of quantum physics, astronomy, dance, art or javascript.

All you have to do is look at the speed at which new info is being added to Wikipedia and Stackoverflow which is stabilizing, i.e. it is not growing as it once was. Basic/foundational knowledge is more or less all covered. https://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia%...

And that sum total comes to 50-60 GB compressed. Think about that number. It's not big.

weaksauce · on Sept 10, 2017

The sum total of our collective intelligence is equal to an install of gtaV... Crazy.

panglott · on Sept 11, 2017

Wikipedia is not the sum of our collective knowledge. It's little more than the preface.

amelius · on Sept 10, 2017

We're talking about the "long tail" of information, which is huge also outside of science. Think popular culture.

curioussavage · on Sept 10, 2017

It would be awesome if you could download dumps of wikepedia filtered by category so You can get the size down. Probably a lot of information that is useless to me in there

freeflight · on Sept 10, 2017

Kiwix does this, at least to a certain degree: http://wiki.kiwix.org/wiki/Content

TheGrassyKnoll · on Sept 11, 2017

Listen to Wikipedia http://listen.hatnote.com

QAPereo · on Sept 10, 2017

NLP is rightly ignored.

https://en.m.wikipedia.org/wiki/Neuro-linguistic_programming...

Edit: Fortunately I'm left feeling foolish, rather than horrified.

cantagi · on Sept 10, 2017

The GP probably meant https://en.wikipedia.org/wiki/Natural_language_processing

feelin_googley · on Sept 11, 2017

The average user's needs are so small.

You do not even need "Google level" for most of today's web users.

You can deliver what users need with respect to web search with much less than "Google level".

For example a simple "<title>" search. This is how Google started.

The entry point into the web should be search for domains. A "<title>" search can do that.

Most users today do not do much searching within websites via Google. They search for websites using Google.

Anyway, you are right about storage space and offline search but obviously that truth misaligns with the "cloud" business narrative and coaxing users to store all their personal data in datacenters instead of on their desk or in their pocket.

Expect much opposition to this simple truth.

exikyut · on Sept 11, 2017

http://web.archive.org/ now provides full-text search, mostly of website titles.

Try it out. You'll find that it's... it feels like a trip back to 1998.

detaro · on Sept 11, 2017

I'd say especially the average user profits from a search system that's somewhat clever and finds things even if they do not ask the exactly right query.

And searching for domains is only a tiny part of it, especially now where a lot of information is stuck in general sites with a lot of content (wikis, Q&A sites, social media sites) and not on special-interest sites. And for many generic searches the special-interest domains are various levels of spam/affiliate marketing.

hedora · on Sept 10, 2017

PCIe 3 x16 devices have a 16GB/s theoretical max, so 1000GB/s is still out of reach for single machine I/O (though it's not as though search needs anywhere near these bandwidths anyway).

colechristensen · on Sept 10, 2017

The Intel i9-7900x has 44 PCIe 3.0 lanes and wikipedia tells me each lane has throughput 984.6 MB/s so there's ~40 GB/s, maybe fast compression could make a small integer multiple.

https://www.intel.com/content/www/us/en/products/processors/...

brodock · on Sept 10, 2017

AMD Threadripper has 64 in all available models: https://en.wikipedia.org/wiki/Zen_(microarchitecture)

scalyr2 · on Sept 10, 2017

https://blog.scalyr.com/2014/05/searching-20-gbsec-systems-e...

joshuamorton · on Sept 10, 2017

That blog seems to imply you're using a distributed architecture, ie. not a single machine.

carussell · on Sept 10, 2017

I've been using Google and Wolfram Alpha for these things over the years, but it has always irked me that I'm sending this info to a third-party, to run through their services that I have no way to read or improve the code, and knowing that these things are only available to me if I'm online. I was really happy when I found out the DuckDuckGo Instant Answers modules' source code is open.

It's been on my list of things that I will almost definitely never take the time to actually work on, but I wished what I had was (A) a browser extension or GNOME extension that incorporates an offline version of all the DuckDuckHack modules, and (B) the same thing in an open source mobile app. (This kind of thing could just as easily live in a command line app, though, and I'd be super happy if a project maintainer incorporated them into something like GNU Units.) I looked into it, especially for (B), but I realized that the DuckDuckHack code depends on Perl.

Nib · on Sept 10, 2017

Well, about offline availability, a large number of instant answers(spices and fatheads that are) use external APIs or indexed databases from websites, so they can't work offline.

DDG does have official(and unofficial) browser extensions and apps for iOS/Android.

carussell · on Sept 10, 2017

> Well, about offline availability, a large number of instant answers [...] can't work offline

Sure, but there are a large number of instant answers that can and do work offline because they're simple, static tables, or are self-contained—existing only to apply transformations on the input (e.g., cheatsheets, natural language unit conversions, and calculations).

> DDG does have official(and unofficial) browser extensions and apps for iOS/Android

A browser extension that just sends the query the same as it would if you hit their homepage is in the "what's the point?" category, just like mobile sites that nag you to install their app when all it does is show you the same content that is (or could be) on the mobile site itself. The "is a browser extension" is not the interesting part. "Doesn't send data to a third party" and "can operate without being connected to the network" are.

javiramos · on Sept 10, 2017

Why can't we have an intermediary search service that grabs search results from Google and posts them on a search website anonymously?

Zhyl · on Sept 10, 2017

Startpage [1] is what you're looking for.

[1] https://startpage.com

LizMcIntyre · on Sept 11, 2017

Right. StartPage.com delivers Google search results in privacy. Plus, it offers a free proxy with every search result so you can visit websites through StartPage anonymously, too.

hedora · on Sept 10, 2017

In DuckDuckGo, !g more or less does this, in that it disables search bubbling, but I think google can see your client IP when the results are served to your browser.

LizMcIntyre · on Sept 11, 2017

Banging into Google using !G is like searching Google directly. Banging from DDG doesn't confer any privacy protections. A lot of people don't know this.

weaksauce · on Sept 10, 2017

Start page does just that. Ddg something and use !sp to search there.

FabHK · on Sept 11, 2017

Let me save you a lot of time for the future:

!s is enough to redirect to Startpage. :-)

dalf · on Sept 10, 2017

searx proxies user requests to different search engines.

https://github.com/asciimoo/searx

there are different instances : https://github.com/asciimoo/searx/wiki/Searx-instances