The work done by dataspectrum was just what you expect from top notch professionals: all the problems were solved accurately, promptly and efficiently. Look-alike recommendation and keyword-extraction/text-summarization systems were developed from business requirements to complete solution running in production. I would certainly recommend dataspectrum as the right guys to work with in a high standards project.
Platform.io is an Open Real-Time Bidding Platform. Put simply, it’s a middleman between advertisers and website owners (webmasters). Webmaster allocates a place for ads (for example, 300x200 px banner in the top right corner of every page). Platform.io sends an ad to be placed in that banner. If user have clicked the banner platfrom.io transfers money (cost per click) from advertiser to webmaster leaving some comission for themselves.
There are thousands of advertisers and thousands of webpages. What ad to show at particular webpage? Baseball bats would most likely convert worse than dresses if ad is placed on a beauty blog page. While for sports-related page it's the opposite. That's where Dataspectum comes in.
One of the most important problems big data analysis solves is:
“How to understand what your customers need (even if they don’t know themselves)?”
And the answer is: “Take a new customer and recommend him what has been bought by the most similar previous customers.”
That’s exactly what Amazon & Netflix do when deciding what to recommend you. For platform.io we did exactly the same. Ad impression for us is just what product suggestion for Amazon is - thousands of possible options but you can show only one.
When we first met Eugene they had nothing but the sole idea of what they want. Starting with a deep industry research we’ve decided on technologies, built a proof-of-concept and eventually came up with a fully-automated non-invasive microservice-based end-to-end solution. Four major steps could be distinguished: collecting and storing raw data, ETL, analyzing it, predicting conversion for new users. Let’s elaborate.
First thing to do is collect full history of ads impressions & conversions bound to user profiles. Tracking pixel is used to identify returning users. All data (tens of gigabytes daily) is stored as logs in parquet files on Amazon S3 - cheap to store & fast to process.
When particular ad have been shown to random users enough times to get sufficient statistical power it’s time to analyze it. Prior to building the look-alike model history of each user who was shown ad have to be extracted, “cleaned” and transformed so that all significant features are present and nothing irrelevant left. But the history is spread all over the logs, up to the very beginning. That’s terabytes and counting of data. How to deal with that? The answer is Apache Spark - a modern general-purpose engine for big-data processing. Since Spark stores intermediate calculations in memory it performs up to 100x faster than Hadoop. It’s also much easier to deploy and configure than Hadoop. And it has way better API which drastically reduces development time and reduces risk of bugs.
Now that we have all necessary data in the right format it’s time to build the main part - the look-alike model. It’s also the hardest part. There are hundreds of different machine learning approaches each having dozens of hyperparameters to tune, hundreds of input features to choose from and even more options to combine and preprocess them. All planet’s computer power won’t be enough to do full gridsearch. It’s called data science for a reason - there is no upper bound for performance, only state-of-the-art. After months of research, tests & trials we stopped at a certain variation of a feed-forward deep neural network with imbalanced dataset correction. When model is ready a previously unseen user may be passed to it as an input and probability of him clicking the ad would be returned.
We strongly believe that IT should serve the business, not the opposite. Today’s business demands high resilience, agility and velocity. And IT solutions must accommodate that - be:
Vast amount of ads is shown to random users just to generate some statistics. Result could be further improved if those users were chosen not randomly but according to some logic. For example by finding semantically most similar ad with already built look-alike-model.