Using Big Data & Machine Learning to personalize ads and increase conversion

The work done by dataspectrum was just what you expect from top notch professionals: all the problems were solved accurately, promptly and efficiently. Look-alike recommendation and keyword-extraction/text-summarization systems were developed from business requirements to complete solution running in production. I would certainly recommend dataspectrum as the right guys to work with in a high standards project.

Client: a middleman between advertisers and websites is an Open Real-Time Bidding Platform. Put simply, it’s a middleman between advertisers and website owners (webmasters). Webmaster allocates a place for ads (for example, 300x200 px banner in the top right corner of every page). sends an ad to be placed in that banner. If user have clicked the banner transfers money (cost per click) from advertiser to webmaster leaving some comission for themselves.

Problem: which of thousands possible ads show to particular user?

There are thousands of advertisers and thousands of webpages. What ad to show at particular webpage? Baseball bats would most likely convert worse than dresses if ad is placed on a beauty blog page. While for sports-related page it's the opposite. That's where Dataspectum comes in.

Key Highlights


Digital Advertising


  • Headquarters: USA
  • Operating: Worldwide

Technologies in Use

  • Python, Scikit-learn, Tensorflow, Pandas, Jupyther, iPython, matplotlib, NumPy, Flask, R, Spark, HDFS, Parquet, Java 8, jUnit, Spring Framework(Boot, Integration, Batch, Security, MVC), Gradle
  • Deployment: Docker
  • Environment: AWS, Ubuntu

Big Data Scale

  • 5 TB and growing by tens of gigabytes daily

Solution: big data + machine learning

One of the most important problems big data analysis solves is: “How to understand what your customers need (even if they don’t know themselves)?”
And the answer is: “Take a new customer and recommend him what has been bought by the most similar previous customers.”
That’s exactly what Amazon & Netflix do when deciding what to recommend you. For we did exactly the same. Ad impression for us is just what product suggestion for Amazon is - thousands of possible options but you can show only one.
When we first met Eugene they had nothing but the sole idea of what they want. Starting with a deep industry research we’ve decided on technologies, built a proof-of-concept and eventually came up with a fully-automated non-invasive microservice-based end-to-end solution. Four major steps could be distinguished: collecting and storing raw data, ETL, analyzing it, predicting conversion for new users. Let’s elaborate.

Collecting and storing raw data

First thing to do is collect full history of ads impressions & conversions bound to user profiles. Tracking pixel is used to identify returning users. All data (tens of gigabytes daily) is stored as logs in parquet files on Amazon S3 - cheap to store & fast to process.

Extract-transform-load (ETL)

When particular ad have been shown to random users enough times to get sufficient statistical power it’s time to analyze it. Prior to building the look-alike model history of each user who was shown ad have to be extracted, “cleaned” and transformed so that all significant features are present and nothing irrelevant left. But the history is spread all over the logs, up to the very beginning. That’s terabytes and counting of data. How to deal with that? The answer is Apache Spark - a modern general-purpose engine for big-data processing. Since Spark stores intermediate calculations in memory it performs up to 100x faster than Hadoop. It’s also much easier to deploy and configure than Hadoop. And it has way better API which drastically reduces development time and reduces risk of bugs.

Building the model

Now that we have all necessary data in the right format it’s time to build the main part - the look-alike model. It’s also the hardest part. There are hundreds of different machine learning approaches each having dozens of hyperparameters to tune, hundreds of input features to choose from and even more options to combine and preprocess them. All planet’s computer power won’t be enough to do full gridsearch. It’s called data science for a reason - there is no upper bound for performance, only state-of-the-art. After months of research, tests & trials we stopped at a certain variation of a feed-forward deep neural network with imbalanced dataset correction. When model is ready a previously unseen user may be passed to it as an input and probability of him clicking the ad would be returned.

Delivering solution: Horizontally scalable stateless docker-based microservices

We strongly believe that IT should serve the business, not the opposite. Today’s business demands high resilience, agility and velocity. And IT solutions must accommodate that - be:

  1. Modular (separation of concerns)
  2. Non-invasive (already existing IT infrastructure shouldn’t be changed)
  3. Easy-to-support by future generations (modifying particular business scenario shouldn’t demand understanding of the whole system - only according submodule)
Ideal if stack of technologies is chosen for each module independently so that for each specific task appropriate tools are used - that would drastically increase development pace & reduce number of bugs. All these are properties of micro-services architecture. The whole solution is absolutely opaque to the rest of the team - they only know which REST endpoint to trigger to get the result (click probability for a particular user). We used Docker as a container engine. For the whole process there are 11 distinct microservices written on 3 different languages and over 10 major frameworks.

Result: conversion increased by 21%

Vast amount of ads is shown to random users just to generate some statistics. Result could be further improved if those users were chosen not randomly but according to some logic. For example by finding semantically most similar ad with already built look-alike-model.