Where Technology Meets Staffing
Call us anytime: 1-917-720-1080

What is Apache Hadoop?

This is an open-source software framework.  The framework supports data-intensive distributed applications.  It also supports the running of applications on large clusters of commodity hardware. 

What it does…

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system is designed so that node failures are automatically handled by the framework.

It enables applications to work with thousands of computation-independent computers and petabytes (after terabyte) of data.

Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.  Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with “streaming” to implement the “map” and “reduce” parts of the system.

Hadoop Will Be Used More in Real-Time Applications

Hadoop’s capabilities now make it possible to stream data into the cluster and analyze it in an interactive fashion—both in real time. Hadoop was purpose-built for cost-effective analysis of data sets as enormous as the World Wide Web

  • .possible to stream data into a cluster and analyze it
    • Interactive fashion
    • Real time
  • Built for cost-effective analysis of huge data sets web has to offer 

Revenue-Generating Uses Overtaking Cost-Saving Applications

Hadoop has always been a good fit for applications that process massive amounts of data for predictive modeling and other analytics. More and more, organizations will use these applications to generate revenue by more accurately targeting—and in some cases adapting—products and services

  • Companies currently using this have seen a decrease in cost with data analytics
    • Through accuracy of data received
    • Real-team response – can react quicker & more efficiently

Hadoop Pulls Away From Other Big Data Analytics Alternatives

Hadoop will distance itself from MongoDB, Cassandra, Couchbase and the numerous NoSQL options to become the safe choice. In stark contrast to the fractured and niche-oriented nature of the alternatives, Hadoop offers a uniform approach to a broad set of APIs for big data analytics (including MapReduce, query languages and database access, with easy integration of leading analytic and search platforms) along with an expanding ecosystem to deliver a wide range of services.

  • Other competitors are niche products (MongoDB, Cassandra, Couchbase)
  • Hadoop offers uniform approach to broad set of APIs
    • It also continues to expand its ecosystem daily to deliver wide range of services

Hadoop Expertise Growing Rapidly, but Talent Shortage Remains

The need for data scientists and operations personnel is growing fast, but it is not yet keeping up with the demand. A quick look at sites such as, Monster, Glassdoor, Careerbuilder and others show a high number of data-scientist-type jobs open. noted that every year since 2008, the need for Hadoop administrators has increased and eclipsed similar jobs

  • Talent is not available to use this product
  • Since 2008 the need for Hadoop administrators has significantly eclipsed similar jobs

Where to find this talent…..

SQL-based Tools for Hadoop Will Continue to Expand

The talent pool with structured query language skills is well-established and will drive Hadoop’s support of SQL. SQL-like languages, such as HiveQL and DrQL, are examples of tools that are making Hadoop accessible to the large SQL-fluent community.

  • SQL-like languages used by Hadoop like HiveQL & DrQL are making SQL engineers good source of talent to convert to Hadoop administrators

HBase Will Become a Popular Platform for Blob Stores

HBase is a large-scale, distributed database built on top of the Hadoop Distributed File System (HDFS). One application that is particularly well-suited for HBase is binary large objects (BLOB) stores. HBase is Hadoop’s open-source, nonrelational, distributed database modeled after Google’s BigTable and written in Java. These BLOBs require large databases with rapid retrieval. BLOBs typically are images, audio clips or other multimedia objects, and storing BLOBs in a database enables a variety of innovative applications. One example is a digital wallet—which enables users to upload their credit card images, checks and receipts for online processing; the technology eases banking, purchasing and lost-wallet recovery.

  • Hbase is large-scale, distributed database built on top of Hadoop Distributed File System
  • Binary large objects is great fit
  • HBase is Hadoop’s open source, nonrelational distributed database
  • BLOBs require large databases with rapid retrieval.
  • Blobs enables a variety of innovative applications: DIGITAL WALLET
    • Digital wallet stores credit card images, checks and receipts for online purchasing

HBase Will Emerge as an Attractive Platform for Lightweight OLTP

Facebook Messages, which combines messages, chat and email into a real-time conversation, is the first application in Facebook to use HBase in production. As a result, we will see more Hadoop deployments involved in lightweight online transaction processing (OLTP).

  • Hbase is currently being used for facebook messages, chat and email and turns it into a real-time conversation.
  • Facebook is using it so everyone will follow suit 

Hadoop’s Future Is in the Clouds

Hadoop will become one of the killer apps for cloud adoption. The number of Hadoop clusters offered by cloud vendors is going through an intense uptick as organizations tap Hadoop as a killer application.

  • Number of hadoop clusters offered by cloud vendors is going through the roof
    • Due to its ability to process large amounts of data & analyze it quickly

Why is Hadoop Important???

eWeek cites Hadoop’s ability to analyze data in real-time, its cost-effectiveness in relation to the value of predictive analytics, Hadoop-optimized hardware, and the increasing number of practitioners with Hadoop expertise. This will drive the demand for data scientists, as suggested by a PC World article that states,

“A keyword search for Hadoop on the IT careers site turned up nearly 1,000 jobs posted within the past month, many from software vendors.”

Who is currently using Hadoop?

Google, Yahoo and IBM. 

This is largely being used for search engines and advertising.  They data being collected gives these companies better pictures of what advertising efforts are useful and which ones are a waste.  They also are able to gather more search data and analyze it quickly.  Thousands of people search Google every hour and there is a huge amount of data with that.  This enables ALL the data to be used.

Connect with your local Talener office today