Programmatic advertising depends on the collection and processing of massive amounts of data. A typical real-time bidding solution will collect trillions of first party data points monthly to integrate into machine learning algorithms that get more precise and cost-effective with each new data insight. Given this continuous accumulation and application of information, collecting and processing huge amounts of data is a critical aspect of managing a programmatic platform.
There’s a truism in the data science field that data scientists spend around 80% of their time preparing and managing data, leaving only 20% of their time for analysis. At Motive, we spend a lot of time organizing data, but we’ve honed methods of enhancing our approach, so we can spend more time analyzing for insights rather than just cleaning and consolidating data objects. This hasn’t always been the easiest process since we started using Amazon Web Service’s cloud storage services to build out our programmatic platform. Below you’ll find some best practices for the effective collection, storage, and archival of large amounts of data, as well as some tricks we’ve learned to make the process more efficient.
1. It starts with an ad request
When a user opens an app that has chosen to monetize through advertising, it generates an ad request. Ad requests contain basic, anonymous information about the user – including details like device type, publisher name and category. This information and certain information about the user’s journey from the subsequent ad impression they are served to the install and post-install action they undertake is collected by DSPs (demand-side platforms) and used to make buying decisions that serve relevant ads to users. Given the sheer volume of advertising, this generates a ton of data that needs to be stored and analyzed constantly.
2. Extract it – Transform it – Load it
Once a batch of data is collected, it’s taken through an ETL process. ETL stands for extract, transform and load, and it’s the process of taking large amounts of data, formatting it and loading it into a data warehouse or data management platform (DMP) for storage. Since people are engaging with mobile devices all over the world throughout the day, batches of data are collected continuously and fed through an ETL process and then sorted into long-term storage or storage that is easily accessed for further analysis.
3. Use lifecycle policies to sort your data efficiently
Automating the sorting process is key to the efficient storage of large amounts of data. To do this we’ve set up lifecycle policies in Amazon S3 (Simple Storage Service) buckets so that data is filtered into varying S3 storage classes based on its relevancy to our analytics. Data that is waiting to be picked up by our ETL pipeline is stored in the standard S3 storage class. This data is kept handy for some time in case of necessary backfills or ETL mishaps. It’s eventually picked up by our lifecycle rules and migrated from its hot, standard storage to a cheaper, slightly chilled storage class: Standard-IA (infrequent access). Standard-IA storage allows us to save on storage costs while also allowing us to restore large amounts of data in case of an emergency. In the event that we need to make ad hoc data analysis requests, we leverage Amazon Athena for quick, cheap, one-off queries and visualizations.
4. Don’t be afraid to archive your data
In the case of data – it might be surprising – but more isn’t always better. Collecting hundreds of terabytes of data and letting it sit in storage for an endless amount of time, taking up space and racking up costs, is not only wasteful but inefficient. When we first started building out our ETL process, we went from storing huge amounts of data on S3’s standard and standard-IA storage classes for rapid retrieval to only storing a small fraction of that amount and moving the majority to Amazon’s Glacier storage class, which lets us retrieve data less easily but is much cheaper in storage costs. Doing this created a trade-off between retrieval time and cost, but in the end, we found that the data most useful for our purposes was the most recent. In the name of quality over quantity, we decided to store large amounts of historical data we had in S3 into our less accessible data archives, so we could focus our analytical expertise on analyzing and integrating only the most relevant data into our predictive modeling. Leveraging these contrasting storage solutions has allowed us to process dozens of terabytes of data each day while storing hundreds of terabytes of historical log-level data cost-effectively.