Powerful Data is Shared Data―Conquering the Silo Sprawl

James Petter, VP EMEA, Pure Storage
James Petter, VP EMEA, Pure Storage
6 years ago

According to IDC, spending on data-intensive AI systems in the Middle East & Africa (MEA) region will grow at a CAGR of 32 percent between 2016 and 2021, reaching US$114.22 million in 2021. Projects range from automated customer service agents, shopping and product recommendations to health and safety use cases such as automated cyber threat detection and AI-powered medical research, diagnosis and treatment.

Some have referred to this opportunity as the “Fourth Industrial Revolution.” That’s a massive understatement. The last industrial revolution was driven by the assembly line―a feat of strategic engineering that helped build a car faster. Today we’re talking about feats of engineering that allow cars to drive themselves. It’s less apples to oranges and more apples to atoms! A new generation of tools, fuelled by an ability to ingest, store and deliver unprecedented amounts of data, are driving a tidal wave of innovation, previously relegated to the realms of science fiction.

Data’s role in the future of business cannot be overstated. According to a survey conducted by MIT Technology Review, commissioned by Pure Storage, an overwhelming 87% of leaders across MEA say data is the foundation for making business decisions and 80% believe that it is key to delivering results for customers. But acknowledging the importance of data, and putting data to work are two separate things. To put the latter in perspective, a recent study conducted by Baidu showed its dataset needed to increase by a factor of 10 million in order to lower its language model’s error rate from 4.5 to 3.4%. That’s 10,000,000x more data for 1% of progress.

All this research points to one thing—to innovate and survive in a business environment that is increasingly data-driven, organizations must design their IT infrastructure with data in mind and have complete, real-time access to that data.

Unfortunately, mainstream storage solutions were designed for the world of disk and have historically helped create silos of data. There are four classes of silos in the world of modern analytics―data warehouse, data lake, streaming analytics, and AI clusters. A data warehouse requires massive throughput. Data lakes deliver scale-out architecture for storage. Streaming analytics go beyond batched jobs in a data lake, requiring storage to deliver multi-dimensional performance regardless of data size (small or large) or I/O type (random or sequential). Finally, AI clusters, powered by tens of thousands of GPU cores, require storage to also be massively parallel, servicing thousands of clients and billions of objects without data bottlenecks.

As a consequence, too much data today remains stuck in a complex sprawl of silos. Each is useful for its original task, but in a data-first world, silos are counter-productive. Silos mean organizational data can’t do work for the business, unless it is being actively managed.

Modern intelligence requires a data hub—an architecture designed not only to store data, but to unify, share and deliver data. Unifying and sharing data means that the same data can be accessed by multiple applications at the same time with full data integrity. Delivering data means each application has the full performance of data access that it requires, at the speed of today’s business.

Data hub is a data-centric architecture for storage that powers data analytics and AI. Its architecture is built on four foundational elements:

  • High-throughput for both file and object storage. Backup and data warehouse appliances require massive throughput for mainstream, file-based workloads and cloud-native, object-based applications.
  • True scale-out design. The power of data lake is its native, scale-out architecture, which allows batch jobs to scale limitlessly as software—not the user―manages resiliency and performance.
  • Multi-dimensional performance. Data is unpredictable and can arrive at any speed—therefore, organizations need a platform that can process any data type with any access pattern.
  • Massively parallel. Within the computing industry, there has been a drastic shift from serial to parallel technologies, built to mimic the human brain, and storage must keep pace.

A true data hub must have all four qualities as all are essential to unifying data. A data hub may have other features, like snapshots and replication, but if any of the four features are missing from a storage platform, it isn’t built for today’s challenges and tomorrow’s possibilities. For example, if a storage system delivers high throughput file and is natively scale-out, but needs another system with S3 object support for cloud-native workloads, then the unification of data is broken, and the velocity of data is crippled. It is not a data hub.

For organizations that want to keep data stored, a data hub does not replace data warehouses or data lakes. For those looking to unify and share their data across teams and applications, a data hub identifies the key strengths of each silo, integrates their unique features and provides a single unified platform for business.

Think of storage like a bank, or an investment. We put our money in banks, or in the stock market because we want our money to work for us. Modern organizations need to do the same with data, and they should speak to their preferred vendors to see how they can help.