2020-03-21

Hibernate Search - Aggregation DSL

Sometimes, you don’t just need to list query hits directly: you also need to group and aggregate the hits.

For example, almost any e-commerce website you can visit will have some sort of "faceting", which is a simple form of aggregation. In the "book search" webpage of an online bookshop, beside the list of matching books, you will find "facets", i.e. a count of matching documents in various categories. These categories can be taken directly from the indexed data, e.g. the genre of the book (science-fiction, crime fiction, …​), but also derived from the indexed data slightly, e.g. a price range ("less than $5", "less than $10", …​).

Aggregations allow just that (and, depending on the backend, much more): they allow the query to return "aggregated" hits.

Aggregations can be configured when building the search query:

Defining an aggregation in a search query

SearchSession searchSession = Search.session( entityManager );

AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" ); 

SearchResult<Book> result = searchSession.search( Book.class ) 
        .where( f -> f.match().field( "title" ) 
                .matching( "robot" ) )
        .aggregation( countsByGenreKey, f -> f.terms() 
                .field( "genre", Genre.class ) )
        .fetch( 20 ); 

Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );
Define a key that will uniquely identify the aggregation. Make sure to give it the correct type (see <6>).
Start building the query as usual.
Define a predicate: the aggregation will only take into account documents matching this predicate.
Request an aggregation on the genre field, with a separate count for each genre: science-fiction, crime fiction, …​If the field does not exist or cannot be aggregated, an exception will be thrown.
Fetch the results.
Retrieve the aggregation from the results as a Map, with the genre as key and the hit count as value of type Long.

Defining an aggregation in a search query — object-based syntax

SearchSession searchSession = Search.session( entityManager );

SearchScope<Book> scope = searchSession.scope( Book.class );

AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );

SearchResult<Book> result = searchSession.search( scope )
        .where( scope.predicate().match().field( "title" )
                .matching( "robot" )
                .toPredicate() )
        .aggregation( countsByGenreKey, scope.aggregation().terms()
                .field( "genre", Genre.class )
                .toAggregation() )
        .fetch( 20 );

Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

terms: group by the value of a field
Counting hits grouped by the value of a field


AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class ) )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );


Counting hits grouped by the value of a field, without converting field values
AggregationKey<Map<String, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", String.class, ValueConvert.NO ) )
        .fetch( 20 );
Map<String, Long> countsByGenre = result.getAggregation( countsByGenreKey );

Setting the maximum number of returned entries in a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .maxTermCount( 1 ) )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

Including values from unmatched documents in a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .minDocumentCount( 0 ) )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

Excluding the rarest terms from a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .minDocumentCount( 2 ) )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );
With the Lucene backend, due to limitations of the current implementation, using any order other than the default one (by descending count) may lead to incorrect results. See HSEARCH-3666 for more information.

Ordering entries by ascending value in a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .orderByTermAscending() )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

Ordering entries by descending value in a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .orderByTermDescending() )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

Ordering entries by ascending count in a terms aggregation
AggregationKey<Map<Genre, Long>> countsByGenreKey = AggregationKey.of( "countsByGenre" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByGenreKey, f -> f.terms()
                .field( "genre", Genre.class )
                .orderByCountAscending() )
        .fetch( 20 );
Map<Genre, Long> countsByGenre = result.getAggregation( countsByGenreKey );

range: grouped by ranges of values for a field

Range aggregations are not available on String-based fields.

Counting hits grouped by range of values for a field
AggregationKey<Map<Range<Double>, Long>> countsByPriceKey = AggregationKey.of( "countsByPrice" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByPriceKey, f -> f.range()
                .field( "price", Double.class )
                .range( 0.0, 10.0 )
                .range( 10.0, 20.0 )
                .range( 20.0, null )
        )
        .fetch( 20 );
Map<Range<Double>, Long> countsByPrice = result.getAggregation( countsByPriceKey );

Define the path and type of the field whose values should be considered.
Define the ranges to group hits into. The range can be passed directly as the lower bound (included) and upper bound (excluded). Other syntaxes exist to define different bound inclusion (see other examples below).
null means "to infinity".

Counting hits grouped by range of values for a field — passing Range objects
AggregationKey<Map<Range<Double>, Long>> countsByPriceKey = AggregationKey.of( "countsByPrice" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByPriceKey, f -> f.range()
                .field( "price", Double.class )
                .range( Range.canonical( 0.0, 10.0 ) )
                .range( Range.between( 10.0, RangeBoundInclusion.INCLUDED,
                        20.0, RangeBoundInclusion.EXCLUDED ) )
                .range( Range.atLeast( 20.0 ) )
        )
        .fetch( 20 );
Map<Range<Double>, Long> countsByPrice = result.getAggregation( countsByPriceKey );
With Range.of(Object, Object), the lower bound is included and the upper bound is excluded.
Range.of(Object, RangeBoundInclusion, Object, RangeBoundInclusion) is more verbose, but allows setting the bound inclusion explicitly.
Range also offers multiple static methods to create ranges for a variety of use cases ("at least", "greater than", "at most", …​).
With the Elasticsearch backend, due to a limitation of Elasticsearch itself, all ranges must have their lower bound included (or null) and their upper bound excluded (or null). Otherwise, an exception will be thrown.

If you need to exclude the lower bound, or to include the upper bound, replace that bound with the immediate next value instead. For example with integers, .range( 0, 100 ) means "0 (included) to 100 (excluded)". Call .range( 0, 101 ) to mean "0 (included) to 100 (included)", or .range( 1, 100 ) to mean "0 (excluded) to 100 (excluded)".

Counting hits grouped by range of values for a field, without converting field values
AggregationKey<Map<Range<Instant>, Long>> countsByPriceKey = AggregationKey.of( "countsByPrice" );
SearchResult<Book> result = searchSession.search( Book.class )
        .where( f -> f.matchAll() )
        .aggregation( countsByPriceKey, f -> f.range()
                // Assuming "releaseDate" is of type "java.util.Date" or "java.sql.Date"
                .field( "releaseDate", Instant.class, ValueConvert.NO )
                .range( null,
                        LocalDate.of( 1970, 1, 1 )
                                .atStartOfDay().toInstant( ZoneOffset.UTC ) )
                .range( LocalDate.of( 1970, 1, 1 )
                                .atStartOfDay().toInstant( ZoneOffset.UTC ),
                        LocalDate.of( 2000, 1, 1 )
                                .atStartOfDay().toInstant( ZoneOffset.UTC ) )
                .range( LocalDate.of( 2000, 1, 1 )
                                .atStartOfDay().toInstant( ZoneOffset.UTC ),
                        null )
        )
        .fetch( 20 );
Map<Range<Instant>, Long> countsByPrice = result.getAggregation( countsByPriceKey );

Elasticsearch: fromJson

Defining a native Elasticsearch JSON aggregation as a JsonObject
JsonObject jsonObject =
        /* ... */;
AggregationKey<JsonObject> countsByPriceHistogramKey = AggregationKey.of( "countsByPriceHistogram" );
SearchResult<Book> result = searchSession.search( Book.class )
        .extension( ElasticsearchExtension.get() )
        .where( f -> f.matchAll() )
        .aggregation( countsByPriceHistogramKey, f -> f.fromJson( jsonObject ) )
        .fetch( 20 );
JsonObject countsByPriceHistogram = result.getAggregation( countsByPriceHistogramKey );

The aggregation result is a JsonObject.


Defining a native Elasticsearch JSON aggregation as a JSON-formatted string
AggregationKey<JsonObject> countsByPriceHistogramKey = AggregationKey.of( "countsByPriceHistogram" );
SearchResult<Book> result = searchSession.search( Book.class )
        .extension( ElasticsearchExtension.get() )
        .where( f -> f.matchAll() )
        .aggregation( countsByPriceHistogramKey, f -> f.fromJson( "{"
                        + "\"histogram\": {"
                                + "\"field\": \"price\","
                                + "\"interval\": 10"
                        + "}"
                + "}" ) )
        .fetch( 20 );
JsonObject countsByPriceHistogram = result.getAggregation( countsByPriceHistogramKey );
The aggregation result is a JsonObject.

No comments:

Post a Comment