Client Pay Portal
 hadoop

So, what is Hadoop?

Windows Azure HDInsight is Microsoft’s cloud-based Apache Hadoop service. But what is Hadoop?
 
According to the Apache™ Hadoop® website, Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
 
In simple terms, Hadoop is a distributed computing platform that allows you to rapidly gain insight from massive amounts of data and with little investment.
 

Hadoop includes these core modules:

 

1. Hadoop Common

a. Hadoop common is a collection of the components and interfaces, which support the other Hadoop modules.
i. Distributed file system and I/O operation interfaces
ii. General parallel computation interfaces
iii. Logging
iv. Security management
 

2. Hadoop Distributed File System (HDFS™)

a. HDFS is a distributed filesystem that provides rapid access and storage for large volumes of data.  In contrast to single disk filesystems, distributed filesystems partition storage across a number of machines, which is necessary when datasets outgrow the storage capacity of a single machine.
 

3. Hadoop YARN

a. YARN (Yet Another Resource Negotiator), introduced in Hadoop 2.0, is a framework for job scheduling and cluster resource management.  In addition to core Hadoop MapReduce tasks, YARN enables non-MapReduce tasks to work within a Hadoop installation.  YARN is composed of the following components:
i. Resource Manager
ii. ApplicationMaster
 

4. Hadoop MapReduce

a. MapReduce 2.0 (MRv2) is a YARN-based system for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of coordinated nodes.  In the map phase, data is filtered and sorted into pieces that can be processed independently.  The reduce phase aggregates the results of the map phase to produce the desired output.
 
There are dozens of companies with active Hadoop projects, which are referred to as distributions.  And as you might expect, distributions are often unique in terms of the Hadoop features and functionality they support.
 
Windows Azure HDInsight is based on the Hortonworks Data Platform (HDP) Hadoop distribution.  The Hortonworks ecosystem of features and functionality is pictured below.
 

Source: http://hortonworks.com/products/hdp/, Hortonworks Inc.
 
As of this writing, HDInsight implements the following subset of the Hortonworks Data Platform (HDP) Hadoop ecosystem.
 
 
Component Description
Apache Hive Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via a SQL-like interface for large datasets stored in HDFS.
Apache Pig A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
Apache Sqoop Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
Apache Oozie Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
Apache HCatalog A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
Apache Templeton Templeton provides a REST-like web API for HCatalog and related Hadoop components. Developers make HTTP requests to access Hadoop MapReduce, Pig, Hive, and HCatalog DDL from within applications.

Source: Microsoft Inc.
 

Business/Technical Value of Hadoop

 
“Big Data” is the term used for large and complex data sets that are difficult to process with traditional data processing tools.  The business value of Hadoop is the ability to analyze “Big Data”.  The applicable data sources for Hadoop analysis are too many to list and ever growing.  However, the following data types along with the applicable Hadoop process flow are defined in the white paper “Business Value of Hadoop…as seen through data” (Hortonworks, Inc., June 2013).  This paper is well worth reading in its entirety and is available via the following link.
 


1.       Clickstream Data

 

Source: “Business Value of Hadoop…as seen through data”, Hortonworks Inc., June 2013
 
 

2.       Sentiment Data

 

Source: “Business Value of Hadoop…as seen through data”, Hortonworks Inc., June 2013
 

3.       Server Log Data

 
 

Source: “Business Value of Hadoop…as seen through data”, Hortonworks Inc., June 2013
 

4.       Sensor Data

 

Source: “Business Value of Hadoop…as seen through data”, Hortonworks Inc., June 2013
 

5.       Location Data

 

Source: “Business Value of Hadoop…as seen through data”, Hortonworks Inc., June 2013
 

6.       Unstructured Text

 
Sector Stored Terabytes per Firm (>1000 employees) Estimate of Unstructured Terabytes per Firm (using 80% share assumption) Library of Congress Equivalents per Firm
Securities and investment 3,866 3,093 13
Banking 1,931 1,545 7
Communications and media 1,792 1,434 6
Utilities 1,507 1,206 5
Government 1,312 1,050 4
Discrete manufacturing 967 774 3
Insurance 870 696 3
Process manufacturing 831 665 3
Resource industries 825 660 3
Transportation 801 641 3
Retail 697 558 2
Wholesale 536 429 2
Health care providers 319 296 1
Education 278 255 1
Professional services 231 222 <1
Construction 150 185 <1
Consumer & recreational services   120 <1
Source: “Big data: The next frontier for innovation, competition, and productivity”, McKinsey Global Institute, May 2011
 
In addition to the above list, these data sources are also candidates for Hadoop analysis.
 
1.       Climate data
2.       Demographics
3.       Enterprise data
4.       Financial transactions
5.       Science & statistics
6.       Social media
7.       Web services
 
HDInsight is an excellent tool for analyzing the aforementioned data sources.  HDInsight Hadoop clusters can be built and taken down in minutes, which allows you to quick build a cluster, run your analysis, and tear down the cluster.  This is an economical approach since you are only charged for the processing time you use.  In terms of results, HDInsight integrates with traditional Microsoft Business Intelligence and Data Platform tools (Excel PowerPivot, Excel PowerView, SQL Server Analysis Services, etc.).  This extensibility allows for easy mash-up of HDInsight results with traditional enterprise data sources.
 

References

Apache Hadoop project website
http://hadoop.apache.org/
 
Hortonworks website
http://hortonworks.com/
 
White Paper: “Business Value of Hadoop…as seen through data”
Publisher: Hortonworks Inc.
Publication Date: June 2013
 
Windows Azure website
http://www.windowsazure.com/en-us/






 

Author

Wiz E. Wig, Mascot & Director of Magic
Wiz E. Wig

Director of Magic