Even more so than in 2012, so-called “Big Data” will increasingly be an overarching theme in all IT sectors and verticals. Humans produce 2.5 quintillion bytes (2.5×1018) of data every day and that number is on the rise.
Fortunately, innovative solutions for storing all of this data go hand in hand with virtualization, improving performance and simplifying management tasks for IT professionals. In particular, data deduplication, data virtualization, and data management virtualization, though very different from server and desktop virtualization, provide huge benefits to businesses struggling with data overload, especially when they are already leveraging traditional virtualization technologies.
The concept of data deduplication was first popularized and brought to scale by the company Data Domain (now part of storage giant EMC). The idea is that data is frequently repeated across storage media, wasting space and increasing the time and bandwidth required for backups and file transfers. Data deduplication algorithms identify large matching byte patterns which may represent files, parts of files, or even entire directories, and store only a single copy of such patterns with all other occurrences replaced by a logical reference to the original.
This becomes quite significant for virtualized environments where entire virtual machines may be backed up or copied regularly, and where incremental backups of VMs can contain significant repetition. Similarly, transactional data and other large, dynamic data stores are often replicated in near real time to allow reporting and analytics on non-production data with delays caused by traditional approaches to ETL/backup. In any of these cases, the storage overhead can become quite significant and benefit considerably from deduplication.
Just as machine virtualization abstracts the hardware for a given operating system, data virtualization abstracts the physical volumes and data stores in which data might reside. It allows reporting, analytics, and data mining tools to view disparate sources of data as a single logical data store. Thus, a weekly sales and marketing report might present data from a POS system, a CRM database, website analytics, and an ecommerce backend with the reporting tool designed to simply pull from what appears to be a single sales/marketing database even though the actual data may reside in an Oracle OLTP database, two MySQL databases, and flat files downloaded from Google Analytics.
Data management virtualization
Data management virtualization is the next logical step beyond data virtualization. Data virtualization allows a single application layer interface to access many data sources. Data management virtualization then applies business rules and various transformations to virtualized data. Regardless of the physical location, data management virtualization enables backup, archiving, deduplication, aggregation, business intelligence, and other common data management processes to the data.
Although data management carries with it a well-defined set of best practices, carrying out these practices across the large number of data sources common to mid-sized businesses and large enterprises is both complicated and expensive in terms of time, effort, computation, and programming. Recent advances in the field, though, are easing the process and making use of new approaches to data virtualization to ensure the integrity, usability, and availability of critical data, particularly in heterogeneous and/or virtualized environments.
Data management is a critical function in a data-rich organization. Server and application virtualization needn’t be an impediment to actually making use of the data businesses collect at a lightning pace.