Archive for March, 2015

Cluster capacity advice from an EMC Isilon expert

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Avoiding scenarios where your cluster reaches maximum capacity is crucial for making sure it runs properly. Our Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide contains information to help Isilon customers keep their clusters running smoothly.

However, there are common misperceptions about cluster capacity, such as the notion that it’s easy to delete data from a cluster that is 100 percent full. Another misunderstanding: using Virtual Hot Spare (VHS) to reserve space for smartfailing a drive is not always necessary.

To clarify these issues and other concerns about cluster capacity, I interviewed one of Isilon’s top experts on this topic, Bernie Case. Bernie is a Technical Support Engineer V in Global Services at Isilon, with many years of experience working with customers who experience maximum cluster capacity scenarios. He is also a contributing author to the Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide. In this blog post, Bernie answers questions about cluster capacity and provides advice and solutions.

Q: What are common scenarios in the field that lead to a cluster reaching capacity?

A: The typical scenarios are when there’s an increased data ingest, which can come from either a normal or an unexpected workflow. If you’re adding a new node or replacing nodes to add capacity, and it takes longer than expected, a normal workflow will continue to write data into the cluster—possibly causing the cluster to reach capacity. Or there is a drive or node failure on an already fairly full cluster, which necessitates a FlexProtect (or FlexProtectLin) job from the Job Engine to run to re-protect data, therefore interrupting normal SnapshotDelete jobs. [See EMC Isilon Job Engine to learn more about these jobs.] Finally, I’ve seen snapshot policies that create a volume of snapshots that takes a long time to delete even after snapshot expiration. [See Best Practices for Working with Snapshots for snapshot schedule tips.]

Q: What are common misperceptions about cluster capacity?

A: Some common misconceptions include:

  • 95 percent of a 1 PiB cluster still leaves about 50TiB of space. That’s plenty for our workflow. We won’t fill that up.
  • Filling up one tier and relying on spillover to another tier won’t affect performance.
  • The SnapshotDelete job should be able to keep up with our snapshot creation rate.
  • Virtual Hot Spare (VHS) is not necessary in our workflow; we need that space for our workflow.
  • It’s still very easy to delete data when the cluster is 100 percent full.

Q: What are the ramifications of a full cluster?

A: When a cluster reaches full capacity, you’re dealing primarily with data unavailable situations—where data might be able to be read, but not written. For example, a customer can experience the inability to run SyncIQ policies, because those policies write data into the root file system (/ifs). There’s also the inability to make cluster configuration changes because those configurations are stored within /ifs.

Finally, a remove (rm) command for deleting files may not function when a cluster is completely full, requiring support intervention.

Q: What should a customer do immediately if their cluster is approaching 90-95 percent capacity?

A: Do whatever you can to slow down the ingesting or retention of data, including moving data to other storage tiers or other clusters, or adjusting snapshot policies. To gain a little bit of temporary space, make sure that VHS is not disabled.

Call your EMC account team to prepare for more storage capacity. You should do this at around 80-85 percent capacity.  It does take time to get those nodes on-site, and you don’t want any downtime.

VHS in SmartPools settings should always be enabled. The default drive to protect is 1 drive, and reserved space should be set to zero. For more information, see KB 88964.

VHS options should always be selected to set aside space for a drive failure. You should have at least 1 virtual drive (default value) set to 0% of total storage. For more information on these default values, see KB 88964 on the EMC Online Support site.

Q: What are the most effective short-term solutions for managing or monitoring cluster capacity?

A: Quotas are an effective way to see real-time storage usage within a directory, particularly if you put directories in specific storage tiers or node pools. Leverage quotas wherever you can.

The TreeDelete job [in the Job Engine] can quickly delete data, but make sure that the data you’re deleting isn’t just going into a snapshot!

Q: What are the most effective long-term solutions to implement from the best practices guide?

A: Make sure you have an event notifications properly configured, so that when jobs fail, or drives fail, you’ll know it and can take immediate action. In addition to notifications and alerts, you can use Simple Network Management Protocol (SNMP) to monitor cluster space, for an additional layer of protection.

InsightIQ and the FSAnalyze job [which the system runs to create data for InsightIQ’s file system analytics tools] can give great views into storage usage and change rate, over time, particularly in terms of daily, monthly, or weekly data ingest.

Q: Is there anything you would like to add?

A: Cluster-full situations where the rm command doesn’t work are sometimes alarming. In a file system such as OneFS, a file deletion often requires a read-modify-write cycle for metadata structures, in addition to the usual unlinking and garbage collection that occurs within the file system. Getting out of that situation can be challenging and sometimes time-consuming. Resolving it requires a support call—and a remote session, which can be a big problem for private clusters.

Sometimes accidents happen or a node can fail, which can push a cluster to the limit of capacity thresholds. Incidents such as these can occasionally lead to data unavailability situations that can halt a customer’s workflow. Being ready to add capacity at 80-85 percent can prevent just this sort of situation.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

New EMC Isilon support content for February 2015

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Are you interested in new and relevant EMC® Isilon® customer support content? Each month I’ll post a summary of newly published content for Isilon customers, as well as the top 10 most viewed knowledgebase articles.

New Isilon support content

Here are new customer support documents that were published in February 2015. For example, you’ll find new Info Hubs on the Isilon Community and new KB articles. Login to the EMC Online Support site is required for all content except Isilon Community content and videos.

CONTENT TYPE

TITLE AND LINK

Isilon Community (ECN) Blog WORM for Hadoop on Isilon using SmartLock
Isilon Community (ECN) Info Hub HD400 – Info Hub
Isilon Community (ECN) Info Hub InsightIQ – Info Hub
Isilon Community (ECN) Info Hub Updated Uptime – Info Hub
Release Notes Isilon OneFS 7.0.2 MR Release Notes
KB Article How to configure OneFS to allow for FTPS connections (174371)
KB Article OneFS 7.1.1.2: SMB and Authentication Rollup Patch (196928)
KB Article OneFS: The FTP Service (vsftpd) supports clear text authentication default (197800)
White Paper EMC Isilon Best Practices for Hadoop Data Storage (OneFS 7.2) 

 

Most viewed knowledgebase (KB) articles

Check out February’s top 10 most viewed KB articles. Our top two articles help you plan and prepare for upgrading from OneFS 6.5.

  1. OneFS 7.0 and 7.1 Pre-Upgrade Check utility (89525)
  2. Patches to provide pre-upgrade configuration checks for OneFS 6.5.4 and 6.5.5.0 – 6.5.5.8 upgrades to OneFS 7.0 or 7.1 (88766)
  3. NEW Product Impacts of Upcoming Leap Second UTC adjustment on June 30th 2015 (197322)
  4. NEW OneFS 7.1.1.2: SMB and Authentication Rollup Patch (196928)
  5. Best practices for NFS client settings (90041)
  6. How to upload files to Isilon Technical Support (16759)
  7. How to reset the CELOG database and clear all historical events (16586)
  8. OneFS sysctl commands (89334)
  9. OneFS 6.5 and later: How to safely shut down an Isilon cluster prior to a scheduled power outage (16529)
  10. Master Article: Remote Proactive (RCM) (193448)

 

Tell us what you want to know! Contact us with questions or feedback about this blog at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Decode the EMC Isilon HD400 Node

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

If you’re looking to archive several petabytes of data, you might want to consider the new EMC Isilon HD400 node. It’s the biggest of the current Isilon nodes in terms of size—you can store up to 354 TB of data on 59 hard drives using a 4U rack space.

To learn more about the external and internal components of the HD400 node, watch the video, Decode the Node: EMC Isilon HD400.

HD400 and OneFS 7.2

The HD400 node was released with the Isilon OneFS operating system 7.2. The OneFS 7.2 release provides support for the following items required for the HD400:

  • New protection levels: New requested protection levels are available in OneFS 7.2 to account for the increased capacity of HD400 nodes. For example, the default protection level for node pools on the HD400 node is “3d:1n1d,” which means that data is protected in case 3 drives fail or if 1 node and 1 drive fail.
  • L3 cache: The HD400 node includes 800 GB of solid-state drive (SSD) storage, which is primarily used for L3 cache. This helps to reduce cache cycling times to improve system performance. For more information, see the L3 Cache Overview topic in the OneFS 7.2 Web Administration Guide or OneFS 7.2 CLI Administration Guide.
  • New drive layout: Disks in the HD400 are arranged in a grid orientation because the drives are inserted top-down into the node chassis. To view the new grid orientation in the OneFS 7.2 web administration interface, go to Dashboard > Cluster Status and click on the ID number. This will take you to the Node Status view, where you can scroll down to view the grid orientation.

    HD400 grid in the OneFS 7.2 web administration  interface.

    HD400 grid in the OneFS 7.2 web administration interface.

 

 

 

 

 

 

 

 

If you have questions about the HD400 node, join the Ask the Expert session on HD400 and OneFS 7.2 that continues through March 8, 2015 in the Isilon Community. Look through the discussion thread for useful information. Or post a question, and you’ll get answers from Isilon hardware and software experts, partners, and customers.

HD400 documentation

If you’re looking for an HD400 specification sheet, a hardware installation guide, or a HD400 field replacement unit (FRU) video, visit the HD400 Info Hub in the Isilon Community. This information hub is a curation destination for links to the latest and most relevant documentation for installing, maintaining, and servicing HD400 nodes.

The HD400 Info Hub on the Isilon Community

The HD400 Info Hub on the Isilon Community.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]