Archive for June, 2015

Data lakes for data science

Steve Hoenisch

Steve Hoenisch

Solutions Architect and White Paper Writer at EMC Emerging Technologies Division
Steve Hoenisch

Latest posts by Steve Hoenisch (see all)

Big data can be challenging for an enterprise organization, because big data affects data scientists, application developers, and infrastructure managers differently. Each of these specialists has different needs when it comes to analytic frameworks and storage infrastructure.

A data lake is a storage strategy to collect data in its native format in a shared storage infrastructure, making data available to different analytics applications, teams, and devices over common protocols. The notion of an EMC Isilon data lake sets the stage for a discussion of the kind of architecture that best supports the enterprise data science program pipeline and the newer, highly scalable big data tools. You can find this discussion in a new white paper, Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.

This blog post highlights the impact that data science has on an enterprise organization, and the considerations for decision makers to keep in mind about analytics frameworks and storage infrastructure. For details about data lake solutions and examples, refer to the white paper.

 

The impact of data science on the enterprise

Implementing an enterprise data science program to analyze big data involves two overarching, interrelated requirements:

  1. The flexibility to use the analytics tool that works best for the dataset on hand.
  2. The flexibility to use the analytics tool that best serves your analytical objectives.

Several aspects of the data science pipeline highlight these requirements:

  1. When you begin to collect data to solve a problem, you might not know the characteristics of the dataset, and those characteristics might influence the analytics framework that you select.
  2. When you have a dataset, but have not yet identified a problem to solve or an objective to fulfill, you might not know which analytics tool or method will best serve your purpose.

 

Analytics frameworks

With the traditional solution of the data warehouse and business intelligence system (DW/BI), these requirements are well known, as the following passage from Margy Ross and Ralph Kimball’s book, ”The Data Warehouse Toolkit,” illustrates:

“The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change. The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn’t invalidate existing data or applications. Existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse.”

However, the fact of the matter is that unknown business problems and varying datasets demand a flexible approach to choosing the analytics framework that will work best for a given project or situation.

In particular, one change that DW/BI systems have difficulty adapting to is the demands of big data. In the face of new business requirements to collect and analyze large sets of unstructured data, DW/BI systems have become barriers to change. Why?

Because a data warehouse or relational database management system (RDMS) is not capable of scaling to handle the volume and velocity of big data and does not satisfy some key requirements of a big data program, such as handling unstructured data. The schema-on-read requirements of an RDMS impede the storage of a variety of data.

Indeed, the sheer variety of data requires a variety of tools—and different tools are likely to be used during the different phases of the data science pipeline. Common tools include Python, the statistical computing language R, and visualization software, such as Tableau. But the framework that many businesses are rapidly adopting is Apache Hadoop.

Analytics tools such as Apache Hadoop, Apache Hive, and Spark underscore the data science pipeline. At each stage of the workflow, data scientists are working to clean their data, extract aspects of it, aggregate it, explore it, model it, sample it, test it, and analyze it. With such work comes many use cases, and each use case demands the tool that best fits the task. During the stages of the pipeline, different tools, such as Apache Hive and Apache Spark, may be put to use.

Storage infrastructure and the data lake

The infrastructure of any data storage system must support data access over multiple protocols so that many tools running on different operating systems, whether on a compute cluster or a user’s workstation, can access the stored data.

The flexibility of a data lake empowers the IT infrastructure to serve the rapidly changing needs of the business, the data scientists, and the big data tools. If the storage solution is flexible enough to support many big data activities, it can yield a sizable return on the investment.

For more information, including examples of data science studies conducted in enterprise environments, read the white paper, “Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.”

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

New EMC Isilon support content for May 2015

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Check out new EMC Isilon customer support content published in the month of May. Each month I’ll post a summary of newly published content for Isilon customers, as well as the top 10 most viewed knowledgebase articles.

New Isilon support content

Here are new customer support content that was published in May 2015. For example, you’ll find new Isilon Community articles about OneFS target code, NFS improvements, and L3 cache best practices. We also have a new technical demo video about the Superna application for disaster recovery, and a new data science white paper.

CONTENT TYPE

TITLE AND LINK

ClusterTalk Podcast Episode 3
Isilon Community (ECN) Uptime Info Hub EMC Technical Advisories (ETAs) for Isilon OneFS
Isilon Community (ECN) Uptime Info Hub OneFS L3 Cache Performance and Best Practices
Isilon Community (ECN) Uptime Info Hub Upgrading to OneFS Target Code
Isilon Community (ECN) Blog OneFS Job Engine & Distributed Work Allocation
Isilon Community (ECN) Blog NFS Improvements in OneFS 7.2
White Paper Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data
Video Technical Demo: Superna Eyeglass for Isilon Version 1.2

Most viewed knowledgebase (KB) articles

  1. Product Impacts of Upcoming Leap Second UTC adjustment on June 30th 2015 (197322)
  2. ETA 199379: UPDATE: Isilon OneFS: Microsoft security update MS15-027 may cause data to be unavailable to SMB clients that are authenticated to Isilon clusters through an Active Directory server that relies on the NTLM authentication protocol (199379)
  3. ESA-2014-146 (193304)
  4. OneFS 7.1.1.2 SMB and Authentication Rollup Patches (196928)
  5. UPDATE: ETA 193819: EMC Isilon nodes: Mars-K+ drives may stop responding and be automatically smartfailed from Isilon nodes (193819)
  6. OneFS: Best practices for NFS client settings (90041)
  7. OneFS: How to reset the CELOG database and clear all historical events (16586)
  8. OneFS: How to safely shut down an Isilon cluster prior to a scheduled power outage (16529)
  9. ETA 200097: Isilon OneFS 7.1.1.0 – 7.1.1.3 and 7.2.0.0 – 7.2.0.1: Attempts to upgrade SSD drive firmware using an Isilon Drive Support Package may result in data loss on clusters that have the L3 cache feature enabled (200097)
  10. OneFS: How to reimage a node using a USB flash drive (16582)

 

Tell us what you want to know! Contact us with questions or feedback about this blog at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Check out the new EMC Isilon podcast

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

If you enjoy listening to technology-related podcasts while commuting on the bus or working out at the gym, there’s a new technology podcast about EMC Isilon that you can add to your listening queue.

The EMC Isilon ClusterTalk podcast was created by Chris Adiletta and Scott Pinzon of EMC Isilon, who also serve as its charismatic hosts. Each monthly hour-long episode features regular segments and expert guests. “Podcast discussions can be more frank and free-wheeling than in a more formal setting, so they provide a great way to address tech issues realistically,” says Scott.

From left to right, ClusterTalk hosts Chris Adiletta and Scott Pinzon

From left to right, ClusterTalk hosts and creators Chris Adiletta and Scott Pinzon

You can download the latest episode now from iTunes or listen on Stitcher.

Why a podcast?

There are several channels you can follow to get the technical information about EMC Isilon products. For example, you can download documentation from the EMC Online Support site (login required), follow @EMCIsilon on Twitter for news and updates, and ask product-related questions on our Isilon Community forum. Now you can listen to the ClusterTalk podcast to learn about tips for getting the most performance, efficiency, and insight from your EMC Isilon OneFS clusters.

“We wanted a way to connect with a large audience of customers over our passion for Isilon, the big data industry, and all of the ways that technology is pushing the boundaries of human capability,” says Chris.

Each episode features a cool command, a popular topic on the Isilon Community, and data storage-related news. You can also hear me each month on the “Hidden Gems” segment, where I reveal a new and intriguing bit of customer support content.

Scott, who also serves as the audio engineer, explains what he loves about the podcast format. “Audio is a fantastic medium for the mind. With sounds, we can help listeners imagine worlds that would require a Hollywood movie budget to create visually, or let them feel like we’re all hanging out discussing big data over beer. Podcasts are terrific for anyone who wants to always be learning!”

For more information, visit the podcast hub on the Isilon Community or show notes for the following episodes:

Feedback

We value your feedback on this podcast. Listeners can also ask questions for Chris and Scott to address on the podcast. You can submit your questions by sending an email to clustertalk@emc.com or leaving a community comment. You can also leave your feedback on this podcast by rating it on iTunes.

[display_rating_result]