Solr

A cluster is a group of nodes managed together by ZooKeeper. When a new core is added to a cluster, it registers with ZooKeeper, which then keeps track of the status of that core.

Each node is either a leader or a replica. A leader is similar to a “master” node, and is responsible for making sure replicas are up to date with the same information as the leader. A replica is similar to a “worker” or “slave” node, which contains a copy (a “replica”) of the index and can serve queries to the index. This provides a level of failover and redundancy.

As updates are made to the index, they are distributed through the cluster. Queries are also distributed, and load balancing is supplied automatically from ZooKeeper.

When planning your cluster, it’s often best to overshard, which means to start with more shards per node than you expect to have in production and then move the shards to new hardware as they grow too large to share a single node of your cluster. This strategy allows you to grow without needing to consider splitting the index to new shards. While Solr allows you to split shards, if you start with a higher number of shards, you get the benefits of increased parallelism during your implementation phases.

Clusters can be resized if necessary. A Collections API allows using HTTP requests to modify a collection.

Depending on the size and use case of your Solr environment, you can either install Solr on separate nodes (larger workloads and collections) or install them on the same nodes as the Datanodes. For this installation I have decided to install Solr on the 3 Datanodes.

Solr aka HDPSearch is part of the HDP-Utils repository (see http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_search/index.html).


Leave a Reply