A data-intensive application is an application that makes an intense usage of data in all its heterogeneous forms. This earnestness of data handling can be measured in several ways. Nowadays, the vast majority of modern applications could be considered data-intensive. Generally speaking, we can call an application data-intensive if data is its primary challenge and from where almost all the business value comes. Furthermore, every application could become a data-intensive one, and probably, all or nearly applications that are not should strive to adopt a data-intensive approach.
A common trap when thinking about these kinds of applications is to focus on the size of the data sets handled. This, however, is not really what makes an application data-intensive. After all, if we had an application that used one petabyte of data, but all that data was static and never changed, we could probably get away with storing it on a single machine. The challenge with data-intensive applications is not necessarily the amount of data they use, but rather the fact that the data is constantly changing and often needs to be processed in real time.
Data-intensive applications are typically built around one or more core pieces of functionality that require access to large amounts of data. For example, a social networking site like Facebook needs to be able to quickly retrieve and process information about the relationships between different users. A search engine like Google needs to be able to index the billions of web pages on the Internet so that users can find the information they are looking for. And a fraud detection system like those used by credit card companies need to be able to analyze large numbers of transactions in real time to look for patterns that might indicate fraudulent activity.
In each of these cases, the functionality of the application is directly related to its ability to process large amounts of data quickly and effectively. These applications focus on packaging consumer-grade analytics experiences in a robust and responsive way. Evolving from traditional analytics and delivered by software engineering teams, the data and the experience around it is the product in this applications. Ultimately, they add value to existing products through purposeful analytics experiences intended to improve operations and efficiencies.
Applications benefit from cloud offerings including storage and delivering analytics platforms that support data apps. Cloud offerings also provide a number of other advantages for data-intensive applications, such as the ability to easily integrate with other cloud-based services and the ability to scale up or down quickly and easily in response to changes in demand.
For example, Facebook makes use of Amazon’s Simple Storage Service (S3) to store images and videos uploaded by users, as well as data generated by the application itself. Facebook also uses Amazon’s Elastic Compute Cloud (EC2) to run its web servers and database servers. Google uses a similar mix of S3 and EC2 for many of its applications, including its search engine, Gmail, and YouTube.
While the use of cloud services is not required for data-intensive applications, it can provide a number of significant advantages in terms of cost, speed, and scalability.
Every product, application, or service actively used by clients has access to various kinds of data:
- Users' proprietary data: consciously inserted and owned by the user, such as business data, profile data, or configuration data
- Third-Party Data: data retrieved from third-party systems (usually Data Management Platform) is used to enrich proprietary user data with insightful information.
- Audit Data: data generated during the usage of the application itself. Records what the user has done and how
It's easy to imagine how all those kinds of data could be aggregated to generate new business insight. Some examples include predictions about what the user needs, user experience customizations, or targeted decision-making strategies on the user's behalf.
The developer and the software architect of a data-intensive application combine several tools working with constantly evolving data: data from disparate systems, data of various types (structured, unstructured, binary, etc.), and varying speed, sizes, and shapes. Application developers are now becoming more and more data engineers; they should be accustomed to working with abstractions and virtualization of data systems to support the diversity of tools and structures, extending the capability of computations to multiple brands of products. This includes integrating with data platforms using APIs, SDKs, and SQL.
Scalability, reliability, and performance are the three main concerns for any data system. Unfortunately, the more the data intensity grows, the more those fundamental characteristics of an excellent data-intensive application will be challenging to implement.
Scalability is what brought life to the idea of distributed data systems. Vertical and centralized data server scalability is an option, but not when we start requiring more than the standard commodity. After passing the standard-hardware commodity limit, specialized hardware costs draw an exceptionally sloping curve. We should consider that scalability for data-intensive applications could happen in various ways. It will depend, of course, on the specific need we are considering - more storage space needed or faster cluster data replication, rather than new geographically distributed nodes.
The level of reliability a standard data-intensive application needs depends on predefined SLAs (Service Level Agreements). Usually, we want the application to be able to handle any errors as fast as possible. Maintainability, instead, becomes increasingly complex as more and more tools and heterogeneous systems are added and aggregated into our application. Therefore, the aim should be to strive for simplicity and an effortless procedure for maintenance.
Finally, let's talk about performance. Usually, in most applications, we want real-time access and instant changes. There are several techniques we can adopt to optimize data access:
- Indexes: standard database indexes could help enhance data access. Developers should, however, pay close attention to how many indexes they want to create, or writing could become costly.
- Materialized Views: Also known as pre-computed queries. In this case, as in the previous one, developers should consider the use cases accordingly before accessing this kind of optimization. The flip side is a considerable increase in storage space used.
- Caches: database caching is similar to Materialized Views. Query results are pre-computed in both cases. The main difference is that caching is a static process: data cannot be cached if input filters change dynamically or if you need to have lots of cached data.
- Geographically distributed database replicas: a CDN-like structure in data systems is challenging to maintain but will grant substantial performance improvements.