The work done by dataspectrum was just what you expect from top notch professionals: all the problems were solved accurately, promptly and efficiently. Look-alike recommendation and keyword-extraction/text-summarization systems were developed from business requirements to complete solution running in production. I would certainly recommend dataspectrum as the right guys to work with in a high standards project.
Platform.io is an Open Real-Time Bidding Platform. Put simply, it’s a middleman between advertisers and website owners (webmasters). Webmaster allocates a place for ads (for example, 300x200 px banner in the top right corner of every page). Platform.io sends an ad to be placed in that banner. If user have clicked the banner platfrom.io transfers money (cost per click) from advertiser to webmaster leaving some comission for themselves.
Understanding user’s intentions is crucial for ads personalization (choosing one out of thousands of possible ads that is most interesting for a particular user). The best way to do it is by summarizing his browsing history and therefore finding his “hottest” topics. For example, if we find out that 35% of webpages John visits contain either ‘baseball’, ‘bat’ or ‘new york yankees’ than we assume that John is fond of baseball and it’s sensible to show him yankees merchandise ads. Not only it’s needed for automatic ads recommendation but also for humans to read and get insights about their audience.
Natural Language Processing (NLP) is a hard task. Mainly because human language requires human intelligence (or true artificial intelligence which as of 2016 doesn’t exist) to be understood. In other words, there are no good algorithm like there are for real numbers multiplication or 3d-objects rendering. So heuristics and machine learning have to be used. From a height of 10000 feet the process looks like this:
Boilerpipe is a neat library designed exactly for that. It runs small neural network under-the-hood.
Syntaxnet is a state-of-the-art parser open-sourced by Google.
Custom recurrent neural network (LSTM) is utilized. About ten different features are used as an input, POS tags are among most important.
In information retrieval TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention. Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic.
End solution consists of a numerous independent microservices each responsible for it’s own specific task (collecting, parsing, processing, etc.). Sometimes called “unix-way” it has lots of advantages. None of those services share any state hence they could be instantly scaled to any capacity with click of a mouse - no conditions to meet and nothing to worry about. Docker is used as a container engine.
Keywords extraction & intentions calculation is an integral part of look-alike recommendation system which gave 21% conversion boost (case study). It’s also used as a tool for advertisers to help them get insights about their audience. But it’s hard to measure KPI for that.