Azure Insider - Hadoop and HDInsight: Big Data in Windows Azure BySeptember 2013 Let’s begin with a bold assertion: “If you, your startup, or the enterprise you work for aren’t saving massive quantities of data to disk for current and future analysis, you are compromising your effectiveness as a technical leader.” Isn’t it foolish to base important business decisions on gut instinct alone, rather than real quantitative data? There are many reasons why big data is so pervasive. First, it’s amazingly cheap to collect and store data in any form, structured or unstructured, especially with the help of products such as Windows Azure Storage services. Second, it’s economical to leverage the cloud to provide the needed compute power—running on commodity hardware—to analyze this data. Finally, big data done well provides a major competitive advantage to businesses because it’s possible to extract undiscovered information from vast quantities of unstructured data. The purpose of this month’s article is to show how you can leverage the Windows Azure platform—in particular the Windows Azure HDInsight Service—to solve big data challenges.
![]()
Barely a day goes by without some hyped-up story in the IT press—and sometimes even in the mainstream media—about big data. Big data refers simply to data sets so large and complex that they’re difficult to process using traditional techniques, such as data cubes, de-normalized relational tables and batch-based extract, transform and load (ETL) engines, to name a few. Advocates talk about extracting business and scientific intelligence from petabytes of unstructured data that might originate from a variety of sources: sensors, Web logs, mobile devices and the Internet of Things, or IoT (technologies based on radio-frequency identification RFID, such as near-field communication, barcodes, Quick Response QR codes and digital watermarking). IoT changes the definition of big—we’re now talking about exabytes of data a day! Does big data live up to all the hype? Microsoft definitely believes it does and has bet big on big data. First, big data leads to better marketing strategies, replacing gut-based decision making with analytics based on real consumer behavior.
Terkaly and Villalobos jointly present at large industry conferences. They encourage readers of Windows Azure Insider to contact them for availability. Reach them at [email protected] or [email protected].
Second, business leaders can improve strategic decisions, such as adding a new feature to an application or Web site, because they can study the telemetry and usage data of applications running on a multitude of devices. Third, it helps financial services detect fraud and assess risk. Finally, though you may not realize it, it’s big data technologies that are typically used to build recommendation engines (think Netflix).
Recommendations are often offered as a service on the Web or within big companies to expedite business decisions. The really smart businesses are collecting data today without even knowing what type of questions they’re going to ask of the data tomorrow. Big data really means data analytics, which has been around for a long time. While there have always been huge data stores being mined for intelligence, what makes today’s world different is the sheer variety of mostly unstructured data.
Fortunately, products like Windows Azure bring great economics, allowing just about anyone to scale up his compute power and apply it to vast quantities of storage, all in the same datacenter. Data scientists describe the new data phenomenon as the three Vs—velocity, volume and variety. Never has data been created with such speed, size and the lack of a defined structure. The world of big data contains a large and vibrant ecosystem, but one open source project reigns above them all, and that’s Hadoop.
Hadoop is the de facto standard for distributed data crunching. The MapReduce Model MapReduce is the programming model used to process huge data sets; essentially it’s the “assembly language” for Hadoop, so understanding what it does is crucial to understanding Hadoop. MapReduce algorithms are written in Java and split the input data set into independent chunks that are processed by the map tasks in a completely parallel manner. The framework sorts the output of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.
The framework takes care of scheduling tasks, monitoring them and re-executing failed tasks. Ultimately, most developers won’t author low-level Java code for MapReduce. Instead, they’ll use advanced tooling that abstracts the complexities of MapReduce, such as Hive or Pig. To gain an appreciation of this abstraction, we’ll take a look at low-level Java MapReduce and at how the high-level Hive query engine, which HDInsight supports, makes the job much easier. Why HDInsight?
HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters. It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it, and tear it down after you run your MapReduce jobs. As Windows Azure Insiders, we believe there are a couple key value propositions of HDInsight. The first is that it’s 100 percent Apache-based, not a special Microsoft version, meaning that as Hadoop evolves, Microsoft will embrace the newer versions. Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of its query optimization know-how to the query tooling, Hive.
The second aspect of HDInsight that’s compelling is that it works seamlessly with Windows Azure Blobs, mechanisms for storing large amounts of unstructured data that can be accessed from anywhere in the world via HTTP or HTTPS. HDInsight also makes it possible to persist the meta-data of table definitions in SQL Server so that when the cluster is shut down, you don’t have to re-create your data models from scratch. Figure 1 depicts the breadth and depth of Hadoop support in the Windows Azure platform. Figure 1 Hadoop Ecosystem in Windows Azure On top is the Windows Azure Storage system, which provides secure and reliable storage and includes built-in geo-replication for redundancy of your data across regions.
Windows Azure Storage includes a variety of powerful and flexible storage mechanisms, such as Tables (a NoSQL, keyvalue store), SQL database, Blobs and more. It supports a REST-ful API that allows any client to perform create, read, update, delete (CRUD) operations on text or binary data, such as video, audio and images.
This means that any HTTP-capable client can interact with the storage system. Hadoop directly interacts with Blobs, but that doesn’t limit your ability to leverage other storage mechanisms within your own code.
The second key area is Windows Azure support for virtual machines (VMs) running Linux. Hadoop runs on top of Linux and leverages Java, which makes it possible to set up your own single-node or multi-node Hadoop cluster.
This can be a tremendous money saver and productivity booster, because a single VM in Windows Azure is very economical. You can actually build your own multi-node cluster by hand, but it isn’t trivial and isn’t needed when you’re just trying to validate some basic algorithms. Setting up your own Hadoop cluster makes it easy to start learning and developing Hadoop applications.
Moreover, performing the setup yourself provides valuable insight into the inner workings of a Hadoop job. If you want to know how to do it, see the blog post, “How to Install Hadoop on a Linux-Based Windows Azure Virtual Machine,” at. Of course, once you need a larger cluster, you’ll want to take advantage of HDInsight, which is available today in Preview mode. To begin, go into the Windows Azure portal and sign in.
Next, select Data Services HDInsight Quick Create. You’ll be asked for a cluster name, the number of compute nodes (currently four to 32 nodes) and the storage account to which to bind. The location of your storage account determines the location of your cluster. Finally, click CREATE HDINSIGHT CLUSTER.
It will take 10 to 15 minutes to provision your cluster. The time it takes to provision is not related to the size of the cluster. Note that you can also create and manage an HDInsight cluster programmatically using Windows PowerShell, as well as through cross-platform tooling on Linux- and Mac-based systems. Much of the functionality in the command-line interface (CLI) is also available in an easy-to-use management portal, which allows you to manage the cluster, including the execution and management of jobs on the cluster.
You can download Windows Azure PowerShell as well as the CLIs for Mac and Linux at. Then set up your VM running CentOS (a version of Linux), along with the Java SDK and Hadoop. Exploring Hadoop To experiment with Hadoop and learn about its power, we decided to leverage publicly available data from. Specifically, we downloaded a file containing San Francisco crime data for the previous three months and used it as is. The file includes more than 33,000 records (relatively small by big data standards) derived from the SFPD Crime Incident Reporting system. Our goal was to perform some simple analytics, such as calculating the number and type of crime incidents.
Figure 2 shows part of the output from the Hadoop job that summarized the crime data. # Make a folder for the input file. Hadoop fs -mkdir /tmp/hadoopjob/crimecount/input # Copy the data file into the folder. Hadoop fs -put SFPDIncidents.csv /tmp/hadoopjob/crimecount/input # Create a folder for the Java output classes.
Mkdir crimecountclasses # Compile the Java source code. Javac -classpath /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar:/usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.0.0-mr1-cdh4.3.0.jar -d crimecountclasses CrimeCount.java # Create a jar file from the compiled Java code. Jar -cvf crimecount.jar -C crimecountclasses/.
# Submit the jar file as a Hadoop job, passing in class path as well as # the input folder and output folder. #.NOTE. HDInsight users can use 'asv:///SFPDIncidents.csv,' instead of # '/tmp/hadoopjob/crimecount/input' if they uploaded the input file # (SFPDIncidents.csv) to Windows Azure Storage. Hadoop jar crimecount.jar org.myorg.CrimeCount /tmp/hadoopjob/crimecount/input /tmp/hadoopjob/crimecount/output # Display the output (the results) from the output folder. Hadoop fs -cat /tmp/hadoopjob/crimecount/output/part-00000 Now you have an idea of the pieces that make up a minimal Hadoop environment, as well as what MapReduce Java code looks like and how it ends up being submitted as a Hadoop job at the command line. Chances are, at some point you’ll want spin up a cluster to run some big jobs, then shut it down using higher-level tooling like Hive or Pig, and this is what HDInsight is all about because it makes it easy, with built-in support for Pig and Hive. Once your cluster is created, you can work at the Hadoop command prompt or you can use the portal to issue Hive and Pig queries.
The advantage of these queries is that you never have to delve into Java and modify MapReduce functions, perform the compilation and packaging, or kick off the Hadoop job with the.jar file. Although you can remote in to the head node of the Hadoop cluster and perform these tasks (writing Java code, compiling the Java code, packaging it up as a.jar file, and using the.jar file to execute it as a Hadoop job), this isn’t the optimal approach for most Hadoop users—it’s too low-level. The most productive way to run MapReduce jobs is to leverage the Windows Azure portal in HDInsight and issue Hive queries, assuming that using Pig is less technically appropriate. You can think of Hive as higher-level tooling that abstracts away the complexity of writing MapReduce functions in Java. It’s really nothing more than a SQL-like scripting language. Queries written in Hive get compiled into Java MapReduce functions.
Moreover, because Microsoft has contributed significant portions of optimization code for Hive in the Apache Hadoop project, chances are that queries written in Hive will be better optimized and will run more efficiently than handcrafted code in Java. You can find an excellent tutorial at. All of the Java and script code we presented previously can be replaced with the tiny amount of code in Figure 5.
It’s remarkable how three lines of code in Hive can efficiently achieve the same or better results than that previous code. # Hive does a remarkable job of representing native Hadoop data stores # as relational tables so you can issue SQL-like statements. # Create a pseudo-table for data loading and querying CREATE TABLE sfpdcrime( IncidntNum string, Category string, Descript string, DayOfWeek string, CrimeDate string, CrimeTime string, PdDistrict string, Resolution string, Address string, X string, Y string, CrimeLocation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; # Load data into table.
LOAD DATA INPATH 'asv://[email protected]/SFPDIncidents.csv' OVERWRITE INTO TABLE sfpdcrime; select count(.) from sfpdcrime; # Ad hoc query to aggregate and summarize crime types. SELECT Descript, COUNT(.) AS cnt FROM sfpdcrime GROUP BY Descript order by cnt desc; There are some important points to note about Figure 5. First, notice that these commands look like familiar SQL statements, allowing you to create table structures into which you can load data. What’s particularly interesting is the loading of data from Windows Azure Storage services.
![]()
Note the asv prefix in the load statement in Figure 5. ASV stands for Azure Storage Vault, which you can use as a storage mechanism to provide input data to Hadoop jobs. As you may recall, while provisioning the process of an HDInsight cluster, you specified one or more specific Windows Azure Storage service accounts. The ability to leverage Windows Azure Storage services in HDInsight dramatically improves the usability and efficiency of managing and executing Hadoop jobs.
We’ve only scratched the surface in this article. There’s a significant amount of sophisticated tooling that supports and extends HDInsight, and a variety of other open source projects you can learn about at the Apache Hadoop portal. Your next steps should include watching the Channel 9 video “Make Your Apps Smarter with Azure HDInsight” at. If your goal is to remain competitive by making decisions based on real data and analytics, HDInsight is there to help. The Hadoop Ecosystem Once you leave the low-level world of writing MapReduce jobs in Java, you’ll discover an incredible, highly evolved ecosystem of tooling that greatly extends Hadoop’s capabilities.
For example, Cloudera and Hortonworks are successful companies with business models based on Hadoop products, education and consulting services. Many open source projects provide additional capabilities, such as machine learning (ML); SQL-like query engines that support data summarization and ad hoc querying (Hive); data-flow language support (Pig); and much more.
Here are just some of the projects that are worth a look: Sqoop, Pig, Apache Mahout, Cascading and Oozie. Microsoft offers a variety of tools as well, such as Excel with PowerPivot, Power View, and ODBC drivers that make it possible for Windows applications to issue queries against Hive data. Visit to see a fascinating visual of the Hadoop ecosystem.
It is not clear how I would deploy a complex Azure Service Fabric application as a Solution Template to the Azure Marketplace. Our end-to-end deployment scenario involves generating self-signed certificate, creating Azure Key Vault, uploading the certificate to the Key Vault, creating Active Directory Application Registration for user authentication, executing main ARM template deploying Service Fabric cluster and other resources, running OpenSSL commands to process certificate file, and then deploying two custom Service Fabric applications to the cluster.
You can see the script 'src CreateNewClusterDeployApps.ps1' in the repository Any guidance would be appreciated.
Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |