Monday, December 26, 2016

ElasticSearch - NoSQL Search Engine

In the world of large data, it is important that we provide the user/customer a simple tool to find the right pieces of information quickly. The search engine is exactly suitable for the requirement. When it comes to the e-commerce world, it is very important as the user has lots of option to buy and Retail Company have a lot of different verity of product to offer for the customer. 

We have a lot of search engine in the market, some are commercial and licensed one like oracle Endeca, and some are open sourced search engine like Solr and ElasticSearch. Both the Solr and ElasticSearch engine offer lot of feature and similar features. Both uses the Apaches Lucene as core component as indexing the data. 

What is Elastic Search?
It is an open source (but operated by a single company) search engine. It is built thinking the cloud-based search engine in mind, basically, the indexed data is present in a different node and also have replica node for the failure of any node. It also has inbuilt node synchronization. When a new node is added to cluster the node is brought up-to-date and the system is balanced & the node is allowed to serve the query. It provides easy horizontal scaling ability. It is complete rest-API based search engine and has very high indexing throughput due to which it adds up for different use cases.

Key Point to Know
·        ES basically data is called "document", we can consider each row when compared to RBDMS.
·        ES is the schema-less search engine.
·        ES document can be structure-less, is not mandatory to follow a given structure in document.
·        Each data is called field, it is the column in a row when compared to DB.
·        We can still describe the document structure in the mapping file.
·        We still have index and type which can reduce the scope of data change and search, which in turn improve the performance.
·        All the action are performed on REST API, including updating settings.
·        All the API call will take data as in JSON format.
·        ES support the nest document, we can see in examples.

Best use case of ElasticSearch

Elastic Search is used in different technology stacks, as it is widely popular for text-based searching, it is used a search engine for different log analysis tools such as ELK (ElasticSearch, Logstash, Kibana) used for data analysis tools to get trends and reports.

As it has best writing thought put, it is also used as No-SQL data search, where back end it is supported by the no-SQL databases such as Cassandra, it will fill the gap of search the data ability as it is good in very high indexing throughput, the data changes are fastly and easily consumed. Elasticsearch engine has plugin which is used to synchronize between.

It is used in e-commerce world as a search engine as it supports the facet (aggregations), auto complete, fuzzy search, It is not the popular once such as solr and endeca. But slowly we see few of the retail site are powered by the ElasticSearch. 

Getting Started with ES
Get before getting started, let install java jre (as ElasticSearch in developed in java), fiddler (any tool to create rest request)/POST plugin in the browser.
  • Download the Elastic search from the site ElasticSearch site
  • Zip the archive, We Can find some folder such as (Bin, Config, Data, Lib, logs, Modules, Plugin). 
    • Config contains 'elasticsearch.yml' which provides configuration for nodes, backup.
    • Bin folder contains the bat files to start the search engine.
  • To start the search engine, go to bin folder and run the elasticsearch.bat file.
  • Default the port for search is 9200 (Can be changed in elasticsearch.yml).
  • Go to browser and access http://localhost:9200/
  • You should access the page which provides cluster information lucene_version.
So now the ES engine is up and run with the default setting. Next step will be loading data to search from.

Data Indexing
As we know the ES Engine is API based engine, we have 2 kinds of API for data upload. 
·        Single document create, update and delete.
·        Bulk create, update and delete.
Single document
          ES have API which are perform action on single document, it can be used when we have to operate on one document at a time.
Curl XPOST http://localhost:9200/<index>/<type>/1 -d {
“Id”:”1”,
“Name”: “Pradeep”
“Address”: {
“Street”: “sapient office”,
“City”: “Bangalore”,
“Zip code”: “560098”,
“Country”: “India”
},
“Location”: [34.05, -118.98],
“Rating”: “4.5”
}

Here we can see that the data can be sapientOffice and type as employee. Using the above curl we can create a new record or update existing document at id=1, both the operation will use the POST method.
The address is the one of the example of nest document, which is supported by ES.

Curl XDELETE http://localhost:9200/<index>/<type>/1 , will delete the document from the ES.

Bulk Document
          ES also provide API for bulk upload of the data for indexing. Below is the syntax of the API for bulk upload of data.
Curl XPOST http://localhost:9200/<index>/<type>/_bulk -d {
{“index”:{}}
{“Name”: “Pradeep”, “Address”: {“Street”: “sapient office”,“City”: “Bangalore”,“Zip code”: “560098”,“Country”: “India”},“Location”: [34.05, -118.98],“Rating”: “4.5”}
{“index”:{}}
{“Name”: “Pradeep”, “Address”: {“Street”: “sapient office”,“City”: “Bangalore”,“Zip code”: “560098”,“Country”: “India”},“Location”: [34.05, -118.98],“Rating”: “4.5”}
{“index”:{}}
{“Name”: “Pradeep”, “Address”: {“Street”: “sapient office”,“City”: “Bangalore”,“Zip code”: “560098”,“Country”: “India”},“Location”: [34.05, -118.98],“Rating”: “4.5”}
}

Here we can see that the data can be sapientOffice and type as employee. Using the above curl we can create a new record or update existing documents, both the operation will use the POST method.

Curl XDELETE http://localhost:9200/<index>/<type>/, will delete all the documents under the under the from the ES.

So now we know how to load the data in ES, let see how to get the data from ES.

ES Query
          One of the key functionality of the search engine is how fast we can retrieve the data and how relevant the data is. ES provide different set of syntax of query for fetching the data and which can be modified to suit our requirement.

Again the query to fetch the data is over API calls and request and response is in the JSON format. ES provides a rich, flexible, query language called the query DSL (domain-specific language), which allows us to build much more complicated, robust queries.

All the search related query are under the “_search” API domain.

Let see different kinds of queries.
1.     Below query will provide all the document under the all type and all index.
Curl XGet http://localhost:9200/<index>/<type>/_search  -d{
“query”:{
“match_all”:{}
}
}

2.     Below query will provide all the document under the all type of index .
Curl XGet http://localhost:9200/<index>/<type>/_search  -d{
“query”:{
“match_all”:{}
}
}

3.     Below query will provide all the document under the type of index .
Curl XGet http://localhost:9200/<index>/<type>/_search -d{
“query”:{
“match_all”:{}
}
}

4.     Below query will provide all the document under the type of index for search term “Pradeep” anywhere (Any field) in document.
Curl XGet http://localhost:9200/<index>/<type>/_search -d{
“query”:{
“query_string”:{
“query”:”Pradeep”
}
}
}

5.     Below query will provide all the document under the type of index for search term “Pradeep” in field Name or address’s street field in document.
Curl XGet http://localhost:9200/<index>/<type>/_search -d{
“query”:{
“query_string”:{
“query”:”Pradeep”,
“fields”:[“Name”,”address.street”]
}
}
}

Using Filter (Provide boundary for search)
6.     Below query will provide all the document under the type of index for search term “Pradeep” in field Name or address’s street field in document and also has the rating in range off.
Curl XGet http://localhost:9200/<index>/<type>/_search -d{
“query”:{
“Filtered”:{
“filter”:{
“range”:{
“rating”:{
“gte”:4.0
}
}
},“query_string”:{
“query”:”Pradeep”,
“fields”:[“Name”,”address.street”]
}
}//query ends
}//Filtered ends
}

7.     Below query will provide all the document under the type of index has the rating in range off.
Curl XGet http://localhost:9200/<index>/<type>/_search -d{
“query”:{
“Filtered”:{
“filter”:{
“range”:{
“rating”:{
“gte”:4.0
}
}
}
}//Filtered ends
}













Thursday, May 12, 2016

Buzzz Word "NO-SQL"

Buzz Word NO-SQL, Yes from last few months, I was hearing a lot about No-SQL, but knew nothing about. This is one more blog out of thousand more which you can find online. It is just a collection of all information which a got while trying to know what is NO-SQL. The try here is not to give complete information, but to introduce you to the NO-SQL world and important term, different type and other related information between them.

What is NO-SQL?

No-SQL is a set of database, which stores or manages unstructured data. It is very important to understand what a database is. A traditional database is called RDBMS, which are relational databases. In this kind of databases, we general store structured data. When we say structured data, it means we define tables, columns and each column has a set of type. When we want to store any data, we convert data into the same format which can be added as per table definition, when we do not have data. We general added null to it. We will also have a relation between the tables using which we combine the data while fetching data using queries.
In NO-SQL we define some structures in some database it is called document and some it is called table. The structures are more helpful to fetch data than defining rules for data storage.In NO-SQL do not have any relationship between the document/tables. Each data is separate entities. It can have a relation in java world or another world, but it is not defined in the database. No-SQL databases don't have join which fetching data, instead we need to run different queries to fetch data.

Why to use No-SQL Databases?

No-SQL, databases are lightweight databases, easily scalable, high performance and can have zero downtime.

  • Lightweight Databases: The ram and memory which is taken the database itself is very less. I remember, No-SQL database such as MongoDB, was one of the preferred databases to be used in mobile application to store data locally. It has the different version of it.
  • Easily Scalable: Even RDBMS databases scalable, but there is a small difference between scales up and scales out. Scale up is increasing the hardware infra of the same machine, whereas the scale out is added one more machine to cluster to increase the capacity.  Even RDBMS databases have the abilities to scale out, but the not an easy step. Most of the NO-SQL databases will run in a cluster, and all the machine is cluster can be a simple desktop machine which we use.
  • High Performance: Most of the No-SQL bet of high performance, you can find a lot of results, which show that the response time for databases is less. But somewhere I find this comparison as apple to the orange, as we have different types of No-SQL databases, each more suits for different needs. Even No-SQL databases have their advantage and disadvantage of the data which is fetched, one of them is join table while fetching data.
  •  Zero Downtime : This is one more advantage for go to No-SQL database is Zero downtime, which means, cluster can be designed in such a way that if some machine fail which running or machine are down for maintains, another machine in cluster can provide the data for request and outside system (requester) which not have effect of database issue. This is an inbuilt feature for most of the databases, as this feature is due to core design on which the NO-SQL systems are developed.


What is Normalization and De-Normalization?

Most of the Relational databases store data in normalization, in simple word, no data duplicate, instead link the data between table using relations like foreign keys. When we want to find any related data, we generally use join keyword in the query, join the table and fetch the data, the advantage here is we reduce data duplication which in turns save the hard disk used, also we can go and update the data at only once place or table. 
In the NO-SQL world, it is said that disk space is cheaper than the CPU & No-SQL don't understand any relation between data, don't provide join between data so that we can join while we are fetching the data. To overcome this difficult it is said that create smaller tables as required by your queries and duplicate that data. For example, if you queries have where clause on the first name and percentage scored, then as these will be different tables, just create one more table and add the data, while fetching the data you can fetch from this table and if need you can fetch another record from respective tables.
But some No-SQL databases, give embedded row inside once record, those are generally document oriented No-SQL databases.

No-SQL Classification

No-SQL is not a database; it is one classification of databases based on what kind of data is stored in it and some other features. We have different kind of databases under the No-SQL or Big Data.
  • Column: Column Key is No-SQL databases, which store data as column wise, if the data is like person data like FristName, LastName, EmailId, password than the data will be fetched in the same way.
  • Document: Document oriented No-SQL database will store the data is a document, when we say document, it is text document and not  binary data like the image and other file sets majorly it will be JSON object, whole JSON object will be added to the database and as part of output/response database will return json objects..
  • Key-value: Key-value databases are once which are like a big hash map, where all data is stored as key and value. It works as session maintaining container, but has extra value as it can persist
  • Graph: These databases can store graphical data, which are like hierarchy data, where data is more of the node based.
  1. Column: Accumulo, Cassandra, Druid, HBase, Vertica
  2. Document: Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
  3. Key-value: Aerospike, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB
  4. Graph: AllegroGraph, InfiniteGraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
  5. Multi-model: Alchemy Database, ArangoDB, CortexDB, FoundationDB, MarkLogic, OrientDB

Where to use No-SQL databases?

No-SQL databases are used main when the data is huge and needs 100% availability. These databases are mainly used for report generation and data analysis where data is huge and generate report using it. As the scalability is easy, it will be database keep growing like data capture for user action, capture log data, audit data.  As also used as back-end databases for micro-service architecture where each service need specific data. As some of the databases use JSON object, it is used for light weight web application where data is stored directly as JSON and read as JSON back. Some of the No-SQL databases (key-value type) are used to store and maintain session across multiple application servers.


Why is this part of ATG Blog?

Coming Soon.... “ATG to Cassandra (No-SQL) integration plugin code”.