Elasticsearch

June 28, 2016 Grzegorz Kapica


Imagine a system that stores a large amount of geolocation data and has to process unrestricted, unpredictable queries and generate a response in real time. I was recently faced with such a problem when asked if I could tune up performance to enable a system to produce responses in near real time.

The challenge involved 5 billion items collected in PostgreSQL 9.4 with the PostGIS 2.2.0 extension installed and queries such as searching for items over the whole globe that would match business criteria including geolocation with responses below 400ms.

Test platform

All tests will be run on following machine:

  • Windows 10 (64 bit)
  • Processor: Intel Core i5-3470 3.20GHz (4 cores)
  • RAM: 12 GB DDR3 1333 MHz
  • Disk: Western Digital Blue WD3200AAKS 320GB 7200 RPM.

The problem

At some point in the past, the number of records in the database started increasing very quickly. As you may have expected, the consequence of this was a progressive search performance degradation. After reaching 1 million items stored in the system, search response times became unacceptable. To make matters worse, the number of records was getting bigger and bigger every day. It became clear that performance issues needed to be investigated and resolved to bring back the user experience to an acceptable level.

At the beginning, I decided to evaluate the existing solution based on the PostgreSQL database, to have a base for comparison for modifications that would be introduced later. I was exploring what the impact of search radius on the total system response time was. The result of this measurement is presented in Figure 1, below. The time needed to complete a search request is almost directly proportional to the maximum distance parameter. Response times are acceptable only if the maximum distance is small enough, which does not satisfy our application API requirements.

Figure 1. Response time as a function of the  maximum distance from a specific constant point.
Figure 1. Response time as a function of the maximum distance from a specific constant point.


In the second step, I created a new solution in which response times would be under 400 milliseconds for every search query, no matter how wide the search radius was. The new approach should also have native support for search result pagination.

Solution

I decided to examine how Elasticsearch [1] 2.2 would handle such a challenge. I configured and started Elasticsearch single node on my development workstation. After creating a mechanism for syncing Elasticsearch with the existing database, I indexed all the item records from the database into Elasticsearch as documents of the type “item”, containing the “location” field of the type “geo_point”. You can see the structure of this documents in the Snippet 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
{
    “mappings”: {
        “item”: {
            “properties”: {
                “location”: {
                    “type”: “geo_point”,
                    “lat_lon”: true,
                    “precision_step”: 5
                }
            }
        }
    }
}

Snippet 1. Structure of the “item” Elasticsearch document, which contains the “location” field of the type “geo_point”.

Embedded geo distance query

The first approach I took was to use the geo distance query embedded into Elasticsearch, which seemed to me to be the simplest and performance-optimal solution for this problem at the time. To build this query I used the GeoDistanceQueryBuilder class from the Elasticsearch Java API, which you can see in Snippet 2. The SearchCommand class represents search request parameters. The following code produces a query to find items whose distance to a specific point is less than or equal to the maximumDistance value.

1
2
3
4
5
6
7
8
9
10
11
12
public class DistanceFilter implements ElasticSearchFilter {
  @Override
  public void apply(BoolQueryBuilder query, SearchCommand cmd) {
    Point point = cmd.getGeometry().getCentroid();
    double maximumDistance = cmd.getRadius();
    GeoDistanceQueryBuilder distanceQuery = geoDistanceQuery("location")
		.distance(maximumDistance, DistanceUnit.METERS)
		.point(point.getY(), point.getX())
		.geoDistance(GeoDistance.SLOPPY_ARC);
    query.filter(distanceQuery);
  }
}

Snippet 2. Fragment of the application code responsible for query building, which filters items by their distance to a specific point.

In the Snippet 3 below, I show an example search request body sent from the application backend to the Elasticsearch server.

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "query" : {
    "bool" : {
      "filter" : [ {
        "geo_distance" : {
          "location" : [ 21.680271943749972, 51.41719033400549 ],
          "distance" : "1000.0m",
          "distance_type" : "sloppy_arc"
        }
      } ]
    }
  }
}

Snippet 3. Example search request body sent to the Elasticsearch server from the application backend.

The performance evaluation results of this solution were very disappointing and surprising to me (Figure 2). When the maximum distance parameter is less or equal to 1000 km, this solution has even worse performance than the original PostgreSQL approach. On the other hand, when the maximum distance is greater than 1000 km, the search speed increases dramatically and the response time drops significantly from 5000 to 100 milliseconds. At first I thought that it was the result of an inappropriate query configuration. After a closer look into Geo Distance Query documentation, I started changing the available configuration parameters of the query to see if it brought any speed improvement. I ran performance tests for all possible values of the “distance_type” parameter, and all of them showed very a similar performance level (Figure 2).

Figure 2. Elasticsearch Geo Distance Query execution time as a function of the maximum distance parameter for different distance computation types (Arc, Sloppy Arc, and Plane).
Figure 2. Elasticsearch Geo Distance Query execution time as a function of the maximum distance parameter for different distance computation types (Arc, Sloppy Arc, and Plane).

Experimenting with the optimize_bbox parameter also did not bring any noticeable speed improvement. This parameter decides whether to use the optimization of first running a bounding box check before the distance check. Response times for different values of the attribute are shown in Figure 3.

Figure 3. Performance comparison for different values of the “optimize_bbox” parameter in the geo distance query.
Figure 3. Performance comparison for different values of the “optimize_bbox” parameter in the geo distance query.

I ultimately realized that this sudden search time change was being caused by something else. After running a search request with the profiling mode enabled, I noticed that most of the search time was spent in the GeoPointDistanceQueryImpl class as you can see below in the Snippet 4.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{"profile": {
    "shards": [
      {
        "id": "[HB_wCxLNSnKNobRvJozEVA][item][0]",
        "searches": [
          {
            "query": [
              {
                "query_type": "BooleanQuery",
                "lucene":"+ConstantScore(GeoPointDistanceQueryImpl: …"
                "time": "6613.512224ms",
                "breakdown": {

Snippet 4. Fragment of the distance search request response with the profiling mode enabled.

I decided to check out the source code of that class for the possible cause of this strange behavior. By analyzing the code, I found out that when the radius parameter of the distance query was bigger than 1000 km, a different shiftFactor value was being picked (Snippet 5), at the exact moment for the sudden speed increase. Unfortunately, all the values used in that code were constant, leaving no room for any configuration – and modifying source code would not be option for me.

1
2
3
4
5
6
7
8
9
10
11
12
@Override
protected short computeMaxShift() {
  final short shiftFactor;
 
  if (query.radiusMeters > 1000000) {
    shiftFactor = 5;
  } else {
    shiftFactor = 4;
  }
 
  return (short)(GeoPointField.PRECISION_STEP * shiftFactor);
}

Snippet 5. Fragment of the org.apache.lucene.search.GeoPointDistanceQueryImpl class source code.

While I was looking for the solution on the Internet, I came across an issue created on the Elasticsearch’s GitHub page [2] related to similar geo distance query performance problems. One of the solutions introduced in that thread was to downgrade the version of Elasticsearch to 1.x, but it was impossible in my case because that version lacked other, important features. The second solution was to use a script query instead of geo distance query.

Script Query

As the previously created solution did not fully satisfy the established requirements and a downgrade to an older version of Elasticsearch was not an option, I decided to utilize the scripting module embedded into Elasticsearch. I wrote custom script in Groovy language (see Snippet 6) and loaded it into Elasticsearch cluster as an indexed script. The variables lat, lon and distanceArg are parameters passed to the script during the building of the query.

1
return doc[‘location’].distance(lat, lon) <= distanceArg

Snippet 6. Custom script to filter documents by the maximum distance to specific coordinates.

The process of the building script query using the Java API is illustrated in Snippet 7.

1
2
3
4
5
6
7
8
9
10
11
@Override
public void apply(BoolQueryBuilder query, SearchCommand cmd) {
	Point point = cmd.getGeometry().getCentroid();
	Map<String, Object> params = Params.newIstance()
		.add("distance", cmd.getRadius())
		.add("lat", point.getY())
		.add("lon", point.getX())
            .build();
	Script s = new Script("distancefilter", INDEXED, "groovy", params);
	query.filter(scriptQuery(s));
}

Snippet 7. Fragment of code responsible for building script query to filter items by distance.

Performance test results of this approach were satisfactory. It offered acceptable performance, almost constant response times (Figure 4) and did not require much additional work. Although this is not an optimal solution (embedded geo-distance query performed better in some cases), I decided that it would be the final approach as it meets all established requirements.

Figure 4. The response time of the distance script query as a function of the maximum distance parameter.
Figure 4. The response time of the distance script query as a function of the maximum distance parameter.

Pagination

Search requests can match a very wide set of items and returning all of them in a single response would be a very bad idea. Luckily for me, Elasticsearch has native support for search result pagination, enabled by default, although the search response contains only the first ten matches. This can easily be modified by specifying two parameters in a search request: size, which decides the number of records in a single page and from, which indicates the number of initial results that should be skipped (defaults to zero).

I have created a solution that overrides Elasticsearch’s default pagination settings, which is shown in Snippet 8. SearchRequestBuilder is a class, shipped along with the Elasticsearch Java API, providing methods for constructing search requests. The page size should be quite small due to the performance considerations. A detailed explanation regarding speed issues related with pagination can be found in the official documentation for Elasticsearch.

1
2
3
4
5
6
7
public class Pagination implements RequestModifier {
  @Override
  public void modify(SearchRequestBuilder request, SearchCommand cmd) {	
    request.setSize(cmd.getMaxResults());
    request.setFrom((cmd.getPageNo() - 1) * cmd.getMaxResults());
  }
}

Snippet 8. The code responsible for overriding the default pagination settings in a search request.

I also investigated how search response time depends on the page size parameter by running a search query which looks up items in a one thousand kilometer radius and returns about 40,000 matched documents. As you can see in Figure 5 below, the larger the page size is, the more time is required to complete the request. It is worth noticing that Elasticsearch uses a cache mechanism, so response times are lower for a repeatedly called search query with the same parameters.

Figure 5. Response time as a function of page size. Other search request parameters are constant.
Figure 5. Response time as a function of the page size. Other search request parameters are constant.

Conclusion

I was assigned to evaluate and improve the item search performance in an existing web application. Since a search request could match a vast set of items, I was forced to choose a solution which supported result pagination. I managed to establish that the main performance bottleneck was a geo-distance filter. In order to meet search time requirements, I created a new solution based on Elasticsearch 2.2.

Adding support for pagination was a trivial task, but it is worth noting that setting an upper limit for page size could require practice. Improving the geo-distance search speed was a more complicated and time-consuming task for me. I must also add that I was completely green in Elasticsearch when I was assigned to this task. Although I encountered performance issues with the new approach, I managed to fulfill the speed requirements and response times are now below 400 milliseconds as expected.

Figure 6. Geo-distance search performance comparison of different solutions.
Figure 6. Geo-distance search performance comparison of different solutions.

I’ve put a chart with performance comparison of different solutions above. The blue line represents the response times of my final solution. As you might have already noticed, it isn’t an optimal approach, especially when the search radius is small enough, but it shows the best overall performance and almost constant response times. It also may only be a temporary problem because it seems like it is Elasticsearch’s version specific issue and could be fixed in future releases.

References

  1. “Elasticsearch reference”, [Access: April 8, 2016].
  2. “Geo_distance query, performance regression (from 1.7 to 2.2)”, [Access: April 12, 2016].

Last posts