If you’ve started using Isilon’s Self-Service Platform, you know how useful its health checks and troubleshooting assistance can be. If you aren’t familiar with the Self-Service Platform, see our previous blog post.
EMC Isilon’s Serviceability team is continually working to improve the tool by adding new capabilities. For example, one new feature in version 1.5.6 checks for outdated patches installed on the cluster. Another new feature tells you whether any EMC Technical Advisories (ETAs) might affect your cluster.
To make sure that you have the most up-to-date version of the tool, visit the Self-Service Platform Info Hub. On the Info Hub, you can download the tool, read the latest documentation, and start a discussion with the development team.
As always, we look forward to hearing from you about your experience.
Let us know!
Let us know what you think. If you have feedback for us about this or any other Isilon technical content, email us at mailto:email@example.com. And thank you!
Do you have an opinion about the technical content that EMC Isilon publishes? The EMC Isilon Information Experience team—who generates documentation, release notes, videos, white papers, and more—wants to hear from you.
Let us know how we’re doing. RSVP for our Ask the Expert event on Isilon Product Community, starting July 27, 2015 and continuing through August 7. During this event, you can submit your questions, opinions, and ideas to a forum discussion thread. Answers will be submitted by the Isilon Information Experience team.
What is the “Ask the Expert” forum?
Ask the Expert (ATE) events are regularly scheduled forums that cover many topics and products. Previous ATE events include Scale-out Data Lakes and SMB Protocol Support. In this special session, content professionals, including our Director of Information Experience, our blogger and social media lead, and several content developers will answer questions we receive from you.
You can ask us about anything related to our technical content, such as:
How can I be notified about the latest Isilon content?
How do you decide what content to publish?
How do I share my idea for a great paper/blog/article with you?
What is an Info Hub and why should I care?
What’s in it for you?
The EMC Isilon Information Experience team will post a summary of our ATE session findings. It will contain a roadmap for when you might expect to see the changes you request, if we can accommodate them, and an honest answer if we cannot.
For years, the global economy has been in transit from goods, to information, to knowledge. In particular, the need for trust grows as customers interact with content more often through more digital platforms and channels. Knowledge is now currency AND product. We recognize that our first contact with you may be through content, and we need to build trust through content.
The best way we can build trust with you is to exchange ideas, and the EMC Isilon Ask the Expert event on technical content is a great way to start the conversation. We hope to talk to you soon!
However, there are common misperceptions about cluster capacity, such as the notion that it’s easy to delete data from a cluster that is 100 percent full. Another misunderstanding: using Virtual Hot Spare (VHS) to reserve space for smartfailing a drive is not always necessary.
To clarify these issues and other concerns about cluster capacity, I interviewed one of Isilon’s top experts on this topic, Bernie Case. Bernie is a Technical Support Engineer V in Global Services at Isilon, with many years of experience working with customers who experience maximum cluster capacity scenarios. He is also a contributing author to the Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide. In this blog post, Bernie answers questions about cluster capacity and provides advice and solutions.
Q: What are common scenarios in the field that lead to a cluster reaching capacity?
A: The typical scenarios are when there’s an increased data ingest, which can come from either a normal or an unexpected workflow. If you’re adding a new node or replacing nodes to add capacity, and it takes longer than expected, a normal workflow will continue to write data into the cluster—possibly causing the cluster to reach capacity. Or there is a drive or node failure on an already fairly full cluster, which necessitates a FlexProtect (or FlexProtectLin) job from the Job Engine to run to re-protect data, therefore interrupting normal SnapshotDelete jobs. [See EMC Isilon Job Engine to learn more about these jobs.] Finally, I’ve seen snapshot policies that create a volume of snapshots that takes a long time to delete even after snapshot expiration. [See Best Practices for Working with Snapshots for snapshot schedule tips.]
Q: What are common misperceptions about cluster capacity?
A: Some common misconceptions include:
95 percent of a 1 PiB cluster still leaves about 50TiB of space. That’s plenty for our workflow. We won’t fill that up.
Filling up one tier and relying on spillover to another tier won’t affect performance.
The SnapshotDelete job should be able to keep up with our snapshot creation rate.
Virtual Hot Spare (VHS) is not necessary in our workflow; we need that space for our workflow.
It’s still very easy to delete data when the cluster is 100 percent full.
Q: What are the ramifications of a full cluster?
A: When a cluster reaches full capacity, you’re dealing primarily with data unavailable situations—where data might be able to be read, but not written. For example, a customer can experience the inability to run SyncIQ policies, because those policies write data into the root file system (/ifs). There’s also the inability to make cluster configuration changes because those configurations are stored within /ifs.
Finally, a remove (rm) command for deleting files may not function when a cluster is completely full, requiring support intervention.
Q: What should a customer do immediately if their cluster is approaching 90-95 percent capacity?
A: Do whatever you can to slow down the ingesting or retention of data, including moving data to other storage tiers or other clusters, or adjusting snapshot policies. To gain a little bit of temporary space, make sure that VHS is not disabled.
Call your EMC account team to prepare for more storage capacity. You should do this at around 80-85 percent capacity. It does take time to get those nodes on-site, and you don’t want any downtime.
VHS options should always be selected to set aside space for a drive failure. You should have at least 1 virtual drive (default value) set to 0% of total storage. For more information on these default values, see KB 88964 on the EMC Online Support site.
Q: What are the most effective short-term solutions for managing or monitoring cluster capacity?
A: Quotas are an effective way to see real-time storage usage within a directory, particularly if you put directories in specific storage tiers or node pools. Leverage quotas wherever you can.
The TreeDelete job [in the Job Engine] can quickly delete data, but make sure that the data you’re deleting isn’t just going into a snapshot!
Q: What are the most effective long-term solutions to implement from the best practices guide?
A: Make sure you have an event notifications properly configured, so that when jobs fail, or drives fail, you’ll know it and can take immediate action. In addition to notifications and alerts, you can use Simple Network Management Protocol (SNMP) to monitor cluster space, for an additional layer of protection.
InsightIQ and the FSAnalyze job [which the system runs to create data for InsightIQ’s file system analytics tools] can give great views into storage usage and change rate, over time, particularly in terms of daily, monthly, or weekly data ingest.
Q: Is there anything you would like to add?
A: Cluster-full situations where the rm command doesn’t work are sometimes alarming. In a file system such as OneFS, a file deletion often requires a read-modify-write cycle for metadata structures, in addition to the usual unlinking and garbage collection that occurs within the file system. Getting out of that situation can be challenging and sometimes time-consuming. Resolving it requires a support call—and a remote session, which can be a big problem for private clusters.
Sometimes accidents happen or a node can fail, which can push a cluster to the limit of capacity thresholds. Incidents such as these can occasionally lead to data unavailability situations that can halt a customer’s workflow. Being ready to add capacity at 80-85 percent can prevent just this sort of situation.
Start a conversation about Isilon content
Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at firstname.lastname@example.org. To provide documentation feedback or request new content, contact email@example.com.
[Editor’s note: Two corrections about the number of serial ports and the front panel light were made to the original Decode the Node videos published on 7/31, and both videos were republished on 8/18 and 8/19. This article was updated to include links to the latest videos.]
To become better acquainted with these Isilon nodes, we have new videos where we decode the Isilon S210 node and the Isilon X410 node. Each video provides the node’s basic specifications, and shows you what all of the node’s components look like, inside and out, including:
Front of the node, with and without the front panel
Back of the node illustrating power supplies, power button, LEDs, and each connection port
Inside of the node
Wonder what’s inside the node?
We show you all the components located inside the node to satisfy your curiosity and increase your understanding about what makes the node run. We open the node for you, because the node must never be opened unless by an EMC certified Customer Support Engineer.
Sometimes node parts have to be replaced. For example, if EMC Isilon Technical Support determines that an external part has failed (such as a power supply, a hard drive, or the node’s front panel) and the node does not need to be opened, they will send a new part and you can replace the part yourself. If an internal node component fails (such as a PCIe card, DIMMs, or a fan) and the node needs to be opened, then only an EMC certified Customer Support Engineer may open the node to replace this type of component. You should never open the node and replace internal components yourself. Isilon Technical Support will schedule the replacement procedure for an internal part at your convenience.
The following new features and enhancements help improve performance for most OneFS workflows:
SMB Multichannel support
OneFS 7.1.1 supports the Multichannel feature of SMB 3.0, which establishes a single SMB session over multiple network connections. SMB Multichannel enables increased throughput, connection failure tolerance, and automatic discovery. To take advantage of this new feature, client computers must be configured with Microsoft Windows 8 or later, or Microsoft Windows Server 2012 or later with supported network interface cards (NICs). For more information, see the SMB Multichannel section of the OneFS 7.1.1 Web Administration Guide and OneFS 7.1.1 CLI Administration Guide.
SmartFlash caching In OneFS, level 1 (L1) cache uses random access memory (RAM) to store copies of system metadata and files requested from front-end networks. Level 2 (L2) cache uses RAM to store copies of file system metadata for files that are stored on the node that owns the data. SmartFlash, or level 3 (L3) cache, uses solid-state drives (SSDs) to hold file data and metadata released from L2 cache, increasing the total size of cache memory available in a cluster as well as the speed that you can retrieve data. In OneFS 7.1.1, SmartFlash is enabled by default for new node pools.
NDMP backup performance improvements
OneFS 7.1.1 uses multiple threads to restore files, making data transfer occur as fast as the tape backup device can deliver it. Additional operational enhancements improve throughput when transferring small files.
SyncIQ® performance enhancements
To allow multiple SyncIQ workers to replicate a single file simultaneously, SyncIQ now allows for file splitting, where a large file is split into segments, each of which is processed in parallel by a different thread.
Security and access zone enhancements
The following enhancements have been made to increase security and support Hadoop workflows:
Access zone enhancements Access zones have been restructured to enforce best practices and improve security. In OneFS 7.1.1, a root or a base directory must be designated for each access zone. SMB shares must subscribe to a single access zone, and access zones can no longer be used to share data. OneFS 7.1.1 also prevents access to non-system zones through NFS, SSH, and the OneFS web administration interface.To support security for Hadoop workflows and enable multiple unstructured datasets to be hosted on a single cluster, access zones now support an HDFS namespace per access zone. This means that you can now run multiple separate HDFS namespaces on the same cluster. Stay tuned for an upcoming ISI Knowledge blog post on this topic.
Self-encrypting drive enhancements This release of OneFS expands the availability of self-encrypting drives (SEDs) to provide data at-rest encryption capabilities across the entire node family. In addition to the 3TB and 4TB SEDs, OneFS 7.1.1 introduces a 900GB SAS SED HDD for S-Series nodes and an 800GB SED SSD for all supported nodes. For details, see the Isilon Product Availability Guide.
Auditing enhancements In OneFS 7.1.1, audit system configuration information can be forwarded to the audit log file for storage and analysis.
Role based access control enhancements New privileges have been added to the role based access control (RBAC) feature in OneFS 7.1.1, such as ISI_PRIV_IFS_BACKUP and ISI_PRIV_IFS_RESTORE. These privileges can be assigned to roles that enable users to back up and restore files that they don’t have explicit permissions to.
Manageability and drive firmware updates
The following OneFS 7.1.1 features make it easier to manage your Isilon cluster and obtain the latest drive firmware:
Microsoft Windows administrators with the correct privileges can remotely administer a share through the MMC shared folders snap-in feature. This enables an administrator to connect to an access zone and directly manage all shares within that zone. To take advantage of this functionality, the Isilon cluster must be joined to an Active Directory domain from which the MMC console can be invoked.
Drive Support Package for non-disruptive drive firmware updates
Drive support packages determine and apply updates for the drive’s firmware automatically, and eliminate the need to apply a patch and reboot the node when you replace or add drives. You can also configure alerts to indicate when you need to update your drive firmware. Review the Isilon Drive Support Package 1.0 release notes for information about system requirements and installation instructions.
For complete details about all of the OneFS 7.1.1 features and enhancements, including changes in functionality, fixed issues, and known issues in this release, refer to the OneFS 7.1.1 Release Notes.
OneFS 7.1.1 documentation and new guides available
OneFS Migration Tools Guide
This guide describes how to migrate data from NetApp filers and EMC VNX storage systems to EMC Isilon clusters using the isi_vol_copy and isi_vol_copy_vnx tools.
OneFS API Reference
This guide—combining the former Platform and RAN API References—describes how the Isilon OneFS application programming interface (API) provides access to configure the cluster and access the data on the cluster. This guide also provides a list of all available API resource URLs, HTTP methods, and parameter and object descriptions.
When you experience technical difficulties with your EMC® Isilon® cluster, you want to quickly find the source of the issue and resolve it. Some issues, such as data integrity errors, require immediate attention from EMC Isilon Technical Support. However, there are issues that you can effectively troubleshoot yourself.
Learn the techniques to become more effective at troubleshooting. Tim Wright, Technical Support Engineer, will cover specific troubleshooting scenarios and tools during the EMC World 2014 session, “Advanced Troubleshooting of EMC Isilon Clusters.” If you’re attending EMC World 2014 in Las Vegas, Nevada, you can attend his sessions on the following dates:
May 5, 3:00 PM – 4:00 PM
May 8, 10:00 AM – 11:00 AM
Tim’s session will cover:
Understanding OneFS architecture
Understanding the types of problems you may encounter
File system, such as space issues, correct protection levels, and snapshot count and schedules
For more information about session date, times, and locations, visit the Session Catalog on the EMC World 2014 website.If you are unable to attend EMC World 2014, let us know which troubleshooting issues you would like to learn more on this blog by sending an email to firstname.lastname@example.org.
It’s important to maintain enough free space on your EMC® Isilon® cluster to ensure that data is protected and workflows are not disrupted. At a minimum, you should have at least one node’s worth of free space available in case you need to protect data on a failing drive.
When your Isilon cluster fills up to more than 90% capacity, cluster performance is affected. Several issues can occur when your cluster fills up to 98% capacity, such as substantially slower performance, failed file operations, the inability to write or delete data, and the potential for data loss. It might take several days to resolve these issues. If you have a full cluster, nearly full cluster, or need assistance with maintaining enough free space, contact EMC Isilon Technical Support.
To prevent your cluster from becoming too full, monitor your cluster capacity. There are several ways to do this. For example, you can configure email event notification rules in the EMC Isilon OneFS® operating system to notify you when your cluster is reaching capacity. Watch the video “How to Set Up Email Notifications in OneFS When a Cluster Reaches Capacity” for a demonstration of this procedure.
Another way to monitor cluster capacity is to use EMC Isilon InsightIQ™ software. If you have InsightIQ licensed on your cluster, you can run FSAnalyze jobs in OneFS to create data for InsightIQ’s file system analytics tools. You can then use InsightIQ’s Dashboard and Performance Reporting to monitor cluster capacity. For example, Performance Reports enable you to view information about the activity of the nodes, networks, clients, disks, and more. The Storage Capacity section of a performance report displays the used and total storage capacity for the monitored cluster over time (Figure 1).
Figure 1: The Storage Capacity section of a Performance Report in InsightIQ 3.0.
For more information about InsightIQ Performance Reports, see the InsightIQ User Guides, which can be found on the EMC Online Support site.
Make sure all nodes in a node pool or disk pool are compatible
If you have a node pool that contains a mix of different node capacities, you can receive “cluster full” errors even if only the smallest node in your node pool reaches capacity. To avoid this scenario, ensure that nodes in each node pool or disk pool are of compatible types. Read the best practices guide for information about node compatibility and for a procedure to verify that all nodes in each node pool are compatible.
Enable Virtual Hot Spare
Virtual Hot Spare (VHS) keeps space in reserve in case you need to move data off of a failing drive (smartfail). VHS is enabled by default. For more information about VHS, read the knowledgebase article, “OneFS: How to enable and configure Virtual Hot Spare (VHS) (88964)” (requires login to the EMC Online Support site).
Spillover allows data that is being sent to a full pool to be diverted to an alternate pool. If you have licensed EMC Isilon SmartPools™ software, you can designate a spillover location. For more information about SmartPools, read the OneFS Web Administration Guide.
If you want to scale-out your storage to add more free space, contact your sales representative.
If you have questions or feedback about this blog or video described in it, send an email to email@example.com. To provide documentation feedback or request new content, send an email to firstname.lastname@example.org.
To open a service request for EMC® Isilon® Technical Support, you’ll need to provide a node serial number. You can easily retrieve this information from the back of the node (or the front of A100 nodes), from the OneFS web administration interface, or from the OneFS command-line interface.
The opinions and interests expressed on Dell EMC employee blogs are the employees' own and do not necessarily represent Dell EMC's positions, strategies or views. Dell EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the Dell EMC logo and content regarding Dell EMC products and services, employee blogs are independent of Dell EMC and Dell EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.