Machine Learning and Extracting Knowledge from Big Data

The Resource Description Framework is essentially an application of Extensible Markup Language (XML) that helps describe Internet resources like a website and its content. RDF descriptions are called metadata since they are typically data about data like the particular site map or date of page updating. RDF is based on the idea of a model that is developed between statements and web resources. It is essential because the framework makes it easier for developers that build a product using that metadata.

A study by Casteleiro et al. (2016) explored the ability to disturbed work functions from machine learning algorithms with the terms of Cardiovascular Disease Ontology. This study was critical because it demonstrated the benefits of using terms from ontology classes to obtain other term variants. The study opened up the research of the feasibility of different methods that can scale with big data and enable automation of machine learning analysis.

 Sajjad, Bajwa, and Kazmi’s (2019) research was already looking at rule engines and producing rules in the era of big data. They proposed a method to work with the semantic complexity in the rules and then do an automated generation of the RDF model of rules to help in analyzing big data.  Specifically, they used a machine learning technique to classify the Semantic of Business Vocabularies and Rules (SBVR) rule and map it to the RDF model. A challenge for the research included the automatic parsing of the rules as well as the semantic interpretation. Also, mapping the vocabulary to the RDF syntax to verify the RDF schema proven successful, but challenging. However, their work did show that it was possible to have consistency in checking a set of big data rules through automated tools. However, these scholars also found a need for a method to semantically analyze rules to help with the testing and validating as it relates to rule changes. Their particular system makes an ontology model that can be useful in the interpretation of a set of rules. This research supports both the semantic understanding of rules, but also generates the RFP model of rules that provides support for querying.

#MachineLearning #Knowledge #BigData #RDF #XML


Casteleiro, M. A., Demetriou, G., Read, W. J., Prieto, M. J. F., Maseda-Fernandez, D., Nenadic, G., … & Stevens, R. (2016). Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations. In ODLS (pp. 1-6).

Sajjad, R., Bajwa, I. S., & Kazmi, R. (2019). Handling Semantic Complexity of Big Data using Machine Learning and RDF Ontology Model. Symmetry11(3), 309.

Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K per sponsored post. Just this year, location-based marketing is predicted to grow to $24.4 billion in ad spending. Nearly half of advertisers plan on using influencer marketing this year as real click rates can translate into purchased products and services.

No alt text provided for this image

As such, this market is ripe for cyber-attacks. However, one way to detect these hackers is to look at the IP size distribution or the number of users that are sharing the same source IP. IP size distributions are created from 1) actual users 2) sponsored providers that provide fraudulent clicks and 3) bot-masters with botnets. The good news is that most machine-generated attacks share an anomalous deviation from the expected IP size distribution. 

However, bots are changing every day as they become more similar to human usage. Gen 1 bots surfaced from in-house scripts but can usually be detected by the absence of cookies. Gen 2 bots are scrappy and can typically be found by the absence of JavaScript firing. Gen 3 bots look like browsers (as compared to Gen 1and 2 bots), but can still be detected using challenge tests and fingerprinting. However, Gen 4 bots look more like human usage with their non-linear mouse movements.

No alt text provided for this image

Security frameworks, supported by machine learning techniques, have been implemented to automatically detect and group deviations. Most detection methods for these Gen 4 bots can be detected with behavioral analysis. Frameworks aggregate statistics around network traffic for investigation recommendations. For example, anomaly detection algorithms can be written to find unusual patterns that do not fit with expected behavior. Code can be written to run MapReduce in parallel processing, assigning a distinct cookie ID for each created click. Then a regression model can be used to compare the IP rates using Poisson distribution with a diverse explanatory model to count the unique cookies and measure the entropy relative to the distribution so that the accurate IP size can be determined. This data can also be analyzed using linear regression and percentage regression techniques to help identify the true IP size.

No alt text provided for this image

Some people have also leveraged historical data in helping create accurate IP size distributions. In this day and age, even a lack of historical data or constant cache cleaning can be used as an input to machine learning techniques to find hackers. However, these methods do depending on securing the click data to run the code to find the source of the fraudulent clicks or bonet behavior.

No alt text provided for this image

The next-generation bots are likely to have more advanced artificial intelligence (AI) making them harder to detect. As a result, AI-based bot detections algorithms need to stay on the leading edge to keep a fair playing field and prevent harm to society.

#Bots #IPDistributionSize #CyberSecurity #BigData