Skip to content

Reindexing documents

Adam Hooper edited this page Jan 13, 2015 · 11 revisions

When Overview slices and dices your documents, it stores some parts of them in different places. The authoritative data store is Postgres. We store text in ElasticSearch for a speed boost; that is derived data.

This page explains how to rebuild the ElasticSearch data using the data in Postgres.

Why reindex?

You may wish to reindex:

  • If we suggest a new document mapping
  • If you want to reconfigure shards and replicas
  • If you had an unexpected failure and you aren't certain your ElasticSearch data is correct
  • If you want to perform an upgrade and this seems like an easy option

Why not reindex?

Reindexing can take a few hours, and it will slow down Overview noticeably.

Reindexing does not make Overview return any different results (unless the original data was wrong).

How to reindex

1. Learn the concepts

You need some ElasticSearch concepts:

  • A cluster is a group of ElasticSearch servers with a name.
  • An index is a place where we store documents.
  • A mapping describes how those documents are stored and indexed.
  • An alias is a name we use to refer to an index.
  • ElasticSearch runs an HTTP server on port *9200-9299 by default and a "transport" server on port 9300-9400. (When it starts up, it picks one that's available.)

Overview follows best practices. It uses an index, documents_v1, with a mapping. It writes new documents to documents, an alias which points to documents_v1. When you create a document set with ID 1234, Overview creates an alias documents_1234 which also points to documents_v1.

Upgrading involves creating a new index -- say, documents_v2 -- and pointing all the aliases to it as we fill it with documents from Postgres. Overview will automatically forget about documents_v1 and start using documents_v2 exclusively.

For the purposes of this example, we'll use these settings:

  • ElasticSearch cluster name: SearchIndex (in development, it would be Dev SearchIndex)
  • Old index name: documents_v1
  • New index name: documents_v2
  • Database URL: postgres://overview:overview@dbserver:9010/overview
  • ElasticSearch HTTP server: http://esserver:9200
  • ElasticSearch transport: esserver:9300

2. Create the new index and mapping

We'll use curl to do this from the command line.

curl -XPUT 'http://esserver:9200/documents_v2' -d @- <<EOT
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "document": {
      "_id": { "path": "id" },
      "properties": {
        "id":              { "type": "long", "store": "yes" },
        "document_set_id": { "type": "long" },
        "text":            { "type": "string" },
        "supplied_id":     { "type": "string" },
        "title":           { "type": "string" }
      }
    }
  }
}
EOT

Choose the settings that you think most appropriate.

Unless you know what you're doing, copy/paste the mapping from common/src/main/resources/documents-mapping.json.

You should see a response like this:

{"ok":true,"acknowledged":true}

3. Run the reindexer (within checked-out Overview source code)

  1. Compile it: ./sbt upgradeReindexDocuments:stage
  2. Run it:
upgrade/reindex-documents/target/universal/stage/bin/reindex-documents \
  --database-url "postgres://overview:overview@dbserver:9010/overview" \
  --elasticsearch-url "localhost:9300" \
  --elasticsearch-cluster "SearchIndex" \
  --index-name "documents_v2"

This will take a long time. If you cancel it by mistake, run it again to resume.

4. Check the aliases have moved

  1. Upload a new document set and test that you can search it.
  2. curl -XGET 'http://esserver:9200/documents_v2/_aliases' should output a lot of aliases. That's because it's the new main index. One of those aliases should be for the document set you just created.
  3. curl -XGET 'http://esserver:9200/documents_v1/_aliases' should output {"documents_v1":{"aliases":{}}} That proves that Overview has forgotten about it.

5. Delete the old index

When you're ready: curl -XDELETE 'http://esserver:9200/documents_v1'

Remember, you're only deleting derived data. If you delete the wrong index by mistake, just run these steps again to rebuild it.

Clone this wiki locally