Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K per sponsored post. Just this year, location-based marketing is predicted to grow to $24.4 billion in ad spending. Nearly half of advertisers plan on using influencer marketing this year as real click rates can translate into purchased products and services.

No alt text provided for this image

As such, this market is ripe for cyber-attacks. However, one way to detect these hackers is to look at the IP size distribution or the number of users that are sharing the same source IP. IP size distributions are created from 1) actual users 2) sponsored providers that provide fraudulent clicks and 3) bot-masters with botnets. The good news is that most machine-generated attacks share an anomalous deviation from the expected IP size distribution. 

However, bots are changing every day as they become more similar to human usage. Gen 1 bots surfaced from in-house scripts but can usually be detected by the absence of cookies. Gen 2 bots are scrappy and can typically be found by the absence of JavaScript firing. Gen 3 bots look like browsers (as compared to Gen 1and 2 bots), but can still be detected using challenge tests and fingerprinting. However, Gen 4 bots look more like human usage with their non-linear mouse movements.

No alt text provided for this image

Security frameworks, supported by machine learning techniques, have been implemented to automatically detect and group deviations. Most detection methods for these Gen 4 bots can be detected with behavioral analysis. Frameworks aggregate statistics around network traffic for investigation recommendations. For example, anomaly detection algorithms can be written to find unusual patterns that do not fit with expected behavior. Code can be written to run MapReduce in parallel processing, assigning a distinct cookie ID for each created click. Then a regression model can be used to compare the IP rates using Poisson distribution with a diverse explanatory model to count the unique cookies and measure the entropy relative to the distribution so that the accurate IP size can be determined. This data can also be analyzed using linear regression and percentage regression techniques to help identify the true IP size.

No alt text provided for this image

Some people have also leveraged historical data in helping create accurate IP size distributions. In this day and age, even a lack of historical data or constant cache cleaning can be used as an input to machine learning techniques to find hackers. However, these methods do depending on securing the click data to run the code to find the source of the fraudulent clicks or bonet behavior.

No alt text provided for this image

The next-generation bots are likely to have more advanced artificial intelligence (AI) making them harder to detect. As a result, AI-based bot detections algorithms need to stay on the leading edge to keep a fair playing field and prevent harm to society.

#Bots #IPDistributionSize #CyberSecurity #BigData