What are Materialized Views?
A Materialized View (MV) is a database object that contains the result of a precomputed query. This object exists for a precise reason: to boost performance and efficiency and reduce network load. As database queries are usually the bottleneck in modern database-intensive applications, MVs are crucial to improving application performance.
Furthermore, as MVs contribute to better performance, they bring down costs, especially that of a data warehouse. In those environments, one often finds its very frequent to find complex aggregation and intensive joins on huge tables containing billions of rows.
This is useful when, for example, business intelligence (BI) users have dashboards that need analytical data extraction (OLAP), one needs to execute internet of things (IoT) processing, or one is running an extract-transform-load process (ETL). Another classical use case, combined with pre-joining collections and data aggregation, is using this kind of database object for data-filtering.
There is an important distinction between materialized views and views. The primary difference pertains to the physical allocation of data. An MV is actually a DB object and, like all tables, contains data. Conversely, a view is a virtual-only table, which does not contain anything. In fact, with views, data is created only when it gets questioned.
Generally speaking, we should create an MV when:
- Query results are needed often: Creating an MV means increasing storage cost together with further maintenance. This is not a worthwhile investment if the MV is not queried frequently. Better performance for queries that are executed usually impacts a lot on user experience. It is a precise cost/benefit balancing game to be played with full awareness.
- Query results could be outdated: We shouldn’t rely on MVs for queries that need real-time data. If stringent consistency of the transaction is at play, any delay on the data update should be considered a big point of concern. Even if there are several mechanisms to automatically refresh MVs (especially in smart data warehouses), adding a level of caching on top of the only source of truth will always bring some data masking.
- The query is expensive and uses lots of resources: While it’s true that it’s possible to rely on MVs to execute simple data filtering or little aggregations, the real point of these objects’ existence is to decrease costs. There is no point in increasing database storage occupation for operations that can be done in other ways.
In all the other cases, we should rely on virtual views or database caching. In some sense, database caching is similar to MVs. Query results are precomputed in both cases. But, as you can imagine, caching is a static process: data cannot be cached if input filters change dynamically, or you need to have lots of cached data.
In all modern data warehouses, it’s possible to specify the refreshing mechanism when creating an MV. Usually, it can be automatic or manual. The automatic mode will check if any difference exists among the source tables of the warehouse and update the MV accordingly, reasoning on the delta of data. In some cases, a best-effort approach could be implemented, refreshing views only when really needed.
Advantages
The main advantages of the MV are performance and efficiency. When using data warehouses, those objects are usually used for a repeatable and predictable workload. Thanks to them, it’s possible to compute once and query many times. Let’s enumerate the principal advantages of MVs:
- Performance: If a standard query consumes a lot of resources (processing time and storage space), it’s possible to cut costs and increase performance with MVs. It’s also possible to increase performance even more by creating indexes on the MVs’ relevant columns. It also reduces the execution time for simple queries which is great for situations where query computation cost is high but the resulting data set is small.
- Data Compactness: Usually, relying on MVs simplifies query logic. Complex join and aggregation could become a single select with some static filters. This action will greatly reduce business logic errors and increase data compactness.
- Access: MVs can be created in the same database where the base tables exist or in a different database as well. This is especially useful if the query is on a remote database. By replicating data, it’s possible to provide local access to the target data.
- Caching: It’s possible to use caching on an MV to further optimize performance and costs. This can be especially useful when the query on the MV is executed with static input parameters.
Challenges
Everything comes with a cost. Here, those are the main challenges that MVs bring along with their advantages:
- Restricted Syntax: MVs use a restricted SQL syntax and a limited set of aggregation functions. Not every query is immediately convertible into a materialized view. All modern data warehouse and database vendors are trying to expand the syntax as much as possible to cover all use cases.
- Costs: Data in MVs is not updated in real-time by default. Modern data warehouses can always show the latest state of the tables using a best-effort approach. This optimization, unfortunately, can increase costs and latency.
- Maintenance: DML data editing on tables on which the materialized views exist has to be replicated onto the materialized view itself. This will always increase maintenance costs even if, with data warehouses, this task is automatically managed by the DBMS itself. It is usually part of the creation cost of MVs.
- Obsolete Data: MVs can be refreshed manually or automatically in a scheduled fashion. In the latter case, it’s difficult to choose an appropriate refresh schedule. If you refresh data too frequently, it may greatly increase the cost of the executions. Refreshing data with long delays can bring obsolete data to users or, if using a best-effort approach, long latencies, and increased costs.
Generally speaking, having two sources of information with the same result is against the standard third normal form of databases. This can also impact applications that use ORM. It will cause different objects to have the same semantic value. The same information could be retrieved from two different sources.