What every software engineer should know about search

Max Grigorev on 2017-08-03

A production search system.

Want to build or improve a search experience? Start here.

Ask a software engineer: “How would you add search functionality to your product?” or “How do I build a search engine?” You’ll probably immediately hear back something like: “Oh, we’d just launch an ElasticSearch cluster. Search is easy these days.” But is it? Numerous current products still have suboptimal search experiences. Any true search expert will tell you that few engineers have a very deep understanding of how search engines work, knowledge that’s often needed to improve search quality.

Even though many open source software packages exist, and the research is vast, the knowledge around building solid search experiences is limited to a select few. Ironically, searching online for search-related expertise doesn’t yield any recent, thoughtful overviews.

Emoji Legend

❗ “Serious” gotcha: consequences of ignorance can be deadly
🔷 Especially notable idea or piece of technology
☁️ ️Cloud/SaaS
🍺 Open source / free software 
🦏 JavaScript
🐍 Python
☕ Java
🇨 C/C++

Why read this?

Think of this post as a collection of insights and resources that could help you to build search experiences. It can’t be a complete reference, of course, but hopefully we can improve it based on feedback (please comment or reach out!).

I’ll point at some of the most popular approaches, algorithms, techniques, and tools, based on my work on general purpose and niche search experiences of varying sizes at Google, Airbnb and several startups. ❗️Not appreciating or understanding the scope and complexity of search problems can lead to bad user experiences, wasted engineering effort, and product failure. If you’re impatient or already know a lot of this, you might find it useful to jump ahead to the tools and services sections.

Some philosophy

This is a long read. But most of what we cover has four underlying principles:

🔷 Search is an inherently messy problem:

Quality, metrics, and processes matter a lot:

Use existing technologies first:

❗️Even if you buy, know the details:

Theory: the search problem

Search is different for every product, and choices depend on many technical details of the requirements. It helps to identify the key parameters of your search problem:

  1. Size: How big is the corpus (a complete set of documents that need to be searched)? Is it thousands or billions of documents?
  2. Media: Are you searching through text, images, graphical relationships, or geospatial data?
  3. 🔷 Corpus control and quality: Are the sources for the documents under your control, or coming from a (potentially adversarial) third party? Are all the documents ready to be indexed or need to be cleaned up and selected?
  4. Indexing speed: Do you need real-time indexing, or is building indices in batch is fine?
  5. Query language: Are the queries structured, or you need to support unstructured ones?
  6. Query structure: Are your queries textual, images, sounds? Street addresses, record ids, people’s faces?
  7. Context-dependence: Do the results depend on who the user is, what is their history with the product, their geographical location, time of the day etc?
  8. Suggest support: Do you need to support incomplete queries?
  9. Latency: What are the serving latency requirements? 100 milliseconds or 100 seconds?
  10. Access control: Is it entirely public or should users only see a restricted subset of the documents?
  11. Compliance: Are there compliance or organizational limitations?
  12. Internationalization: Do you need to support documents with multilingual character sets or Unicode? (Hint: Always use UTF-8 unless you really know what you’re doing.) Do you need to support a multilingual corpus? Multilingual queries?

Thinking through these points up front can help you make significant choices designing and building individual search system components.

A production indexing pipeline.

Theory: the search pipeline

Now let’s go through a list of search sub-problems. These are usually solved by separate subsystems that form a pipeline. What that means is that a given subsystem consumes the output of previous subsystems, and produces input for the following subsystems.

This leads to an important property of the ecosystem: once you change how an upstream subsystem works, you need to evaluate the effect of the change and possibly change the behavior downstream. Here are the most important problems you need to solve:

Index selection:

given a set of documents (e.g. the entirety of the Internet, all the Twitter posts, all the pictures on Instagram), select a potentially smaller subset of documents that may be worthy for consideration as search results and only include those in the index, discarding the rest. This is done to keep your indexes compact, and is almost orthogonal to selecting the documents to show to the user. Examples of particular classes of documents that don’t make the cut may include:

Spam:

oh, all the different shapes and sizes of search spam! A giant topic in itself, worthy of a separate guide. A good web spam taxonomy overview.

Undesirable documents:

domain constraints might require filtering: porn, illegal content, etc. The techniques are similar to spam filtering, probably with extra heuristics.

Duplicates:

Or near-duplicates and redundant documents. Can be done with Locality-sensitive hashing, similarity measures, clustering techniques or even clickthrough data. A good overview of techniques.

Low-utility documents:

The definition of utility depends highly on the problem domain, so it’s hard to recommend the approaches here. Some ideas are: it might be possible to build a utility function for your documents; heuristics might work, or example an image that contains only black pixels is not a useful document; utility might be learned from user behavior.

Index construction:

For most search systems, document retrieval is performed using an inverted index — often just called the index.

Query analysis and document retrieval:

Most popular search systems allow non-structured queries. That means the system has to extract structure out of the query itself. In the case of an inverted index, you need to extract search terms using NLP techniques.

The extracted terms can be used to retrieve relevant documents. Unfortunately, most queries are not very well formulated, so it pays to do additional query expansion and rewriting, like:

Ranking:

Given a list of documents (retrieved in the previous step), their signals, and a processed query, create an optimal ordering (ranking) for those documents.

Originally, most ranking models in use were hand-tuned weighted combinations of all the document signals. Signal sets might include PageRank, clickthrough data, topicality information and others.

To further complicate things, many of those signals, such as PageRank, or ones generated by statistical language models contain parameters that greatly affect the performance of a signal. Those have to be hand-tuned too.

Lately, 🔷 learning to rank, signal-based discriminative supervised approaches are becoming more and more popular. Some popular examples of LtR are McRank and LambdaRank from Microsoft, and MatrixNet from Yandex.

A new, vector space based approach for semantic retrieval and ranking is gaining popularity lately. The idea is to learn individual low-dimensional vector document representations, then build a model which maps queries into the same vector space.

Then, retrieval is just finding several documents that are closest by some metric (e.g. Eucledian distance) to the query vector. Ranking is the distance itself. If the mapping of both the documents and queries is built well, the documents are chosen not by a fact of presence of some simple pattern (like a word), but how close the documents are to the query by meaning.

Indexing pipeline operation

Usually, each of the above pieces of the pipeline must be operated on a regular basis to keep the search index and search experience current.

❗️Operating a search pipeline can be complex and involve a lot of moving pieces. Not only is the data moving through the pipeline, but the code for each module and the formats and assumptions embedded in the data will change over time.

A pipeline can be run in “batch” or based on a regular or occasional basis (if indexing speed does not need to be real time) or in a streamed way (if real-time indexing is needed) or based on certain triggers.

Some complex search engines (like Google) have several layers of pipelines operating on different time scales — for example, a page that changes often (like cnn.com) is indexed with a higher frequency than a static page that hasn’t changed in years.

Serving systems

Ultimately, the goal of a search system is to accept queries, and use the index to return appropriately ranked results. While this subject can be incredibly complex and technical, we mention a few of the key aspects to this part of the system.

A human rater. Yeah, you should still have those.

Quality, evaluation, and improvement

So you’ve launched your indexing pipeline and search servers, and it’s all running nicely. Unfortunately the road to a solid search experience only begins with running infrastructure.

Next, you’ll need to build a set of processes around continuous search quality evaluation and improvement. In fact, this is actually most of the work and the hardest problem you’ll have to solve. 🔷 What is quality? First, you’ll need to determine (and get your boss or the product lead to agree), what quality means in your case:

Metrics: Some of these concepts can be quite hard to quantify. On the other hand, it’s incredibly useful to be able to express how well a search engine is performing in a single number, a quality metric.

Continuously computing such a metric for your (and your competitors’) system you can both track your progress and explain how well you are doing to your boss. Here are some classical ways to quantify quality, that can help you construct your magic quality metric formula:

🔷 Human evaluations: Quality metrics might seem like statistical calculations, but they can’t all be done by automated calculations. Ultimately, metrics need to represent subjective human evaluation, and this is where a “human in the loop” comes into play.

❗️Skipping human evaluation is probably the most spread reason of sub-par search experiences. Usually, at early stages the developers themselves evaluate the results manually. At later point human raters (or assessors) may get involved. Raters typically use custom tools to look at returned search results and provide feedback on the quality of the results.

Subsequently, you can use the feedback signals to guide development, help make launch decisions or even feed them back into the index selection, retrieval or ranking systems. Here is the list of some other types of human-driven evaluation, that can be done on a search system:

Evaluation datasets:

One should start thinking about the datasets used for evaluation (like “golden sets” mentioned above) early in the search experience design process. How you collect and update them? How you push them to the production eval pipeline? Is there a built-in bias? Live experiments:

After your search engine catches on and gains enough users, you might want to start conducting live search experiments on a portion of your traffic. The basic idea is to turn some optimization on for a group of people, and then compare the outcome with that of a “control” group — a similar sample of your users that did not have the experiment feature on for them. How you would measure the outcome is, once again, very product specific: it could be clicks on results, clicks on ads etc. Evaluation cycle time: How fast you improve your search quality is directly related to how fast you can complete the above cycle of measurement and improvement. It is essential from the beginning to ask yourself, “how fast can we measure and improve our performance?”

Will it take days, hours, minutes or seconds to make changes and see if they improve quality? ❗️Running evaluation should also be as easy as possible for the engineers and should not take too much hands-on time.

🔷 So… How do I PRACTICALLY build it?

This blogpost is not meant as a tutorial, but here is a brief outline of how I’d approach building a search experience right now:

  1. As was said above, if you can afford it — just buy the existing SaaS (some good ones are listed below). An existing service fits if:

2. If a hosted solution does not fit your needs or resources, you probably want to use one of the open source libraries or tools. In case of connected apps or websites, I’d choose ElasticSearch right now. For embedded experiences, there are multiple tools below.

3. You most likely want to do index selection and clean up your documents (say extract relevant text from HTML pages) before uploading them to the search index. This will decrease the index size and make getting to good results easier. If your corpus fits on a single machine, just write a script (or several) to do that. If not, I’d use Spark.

You can never have too many tools.

☁️ SaaS

☁️ 🔷Algolia — a proprietary SaaS that indexes a client’s website and provides an API to search the website’s pages. They also have an API to submit your own documents, support context dependent searches and serve results really fast. If I were building a web search experience right now and could afford it, I’d probably use Algolia first — and buy myself time to build a comparable search experience.

Tools and libraries

🍺☕🔷 Lucene is the most popular IR library. Implements query analysis, index retrieval and ranking. Either of the components can be replaced by an alternative implementation. There is also a C port — 🍺Lucy.

Datasets

A few fun or useful data sets to try building a search engine or evaluating search engine quality:

References

This concludes my humble attempt to make a somewhat-useful “map” for an aspiring search engine engineer. Did I miss something important? I’m pretty sure I did — you know, the margin is too narrow to contain this enormous topic. Let me know if you think that something should be here and is not — you can reach me at forwidur@gmail.com or at @forwidur.

P.S. — This post is part of a open, collaborative effort to build an online reference, the Open Guide to Practical AI, which we’ll release in draft form soon. See this popular guide for an example of what’s coming. If you’d like to get updates on or help with with this effort, sign up here.

Special thanks to Joshua Levy, Leo Polovets and Abhishek Das for reading drafts of this and their invaluable feedback!

Header image courtesy of Mickaël Forrett. The beautiful toolbox is called The Studley Tool Chest.