Backend - Integrated Apache SolR for Data Indexing and Discovery (<500ms search in a 1 Million dataset)

New Data Search capability we introduced last month, needed a server-side data indexing and data discovery mechanism, to enable fast data/insights searching. We benchmarked SolR and ElasticSearch and found SolR significantly faster and more robust in the area of NLP/ML, so went ahead and integrated SolR in Germain. 

Benchmark

  • Indexed 1million rows in less than 4 min using our own 3-year old desktop, without any tuning

  • Searched on this 1 million row is taking at the most 500ms


Comparison of Apache SolR & ElasticSearch



Solr

ElasticSearch

Index Speed based on 1mil rows (ootb)

~4min

~22min

Index Speed based on 1mil rows (with simple optimizations)

not tested

~8min

Index Size

~500mb

~750mb

Requires additional tool/software to pull from DB and insert into search platform

No

Yes (Logstash)

Simple Query API

Yes

Yes

Built-in scheduler for updates

No

Yes (Logstash)

Returns entire document as search result

Yes

Yes

Full-Text Search Features (misspealing, synonyms, ..)

Yes (very advanced)

Yes

Overall application

Text search

analytical querying, filtering, and grouping

Nested documents support

No

Yes