The Resource Description Framework is essentially an application of Extensible Markup Language (XML) that helps describe Internet resources like a website and its content. RDF descriptions are called metadata since they are typically data about data like the particular site map or date of page updating. RDF is based on the idea of a model that is developed between statements and web resources. It is essential because the framework makes it easier for developers that build a product using that metadata.
A study by Casteleiro et al. (2016) explored the ability to disturbed work functions from machine learning algorithms with the terms of Cardiovascular Disease Ontology. This study was critical because it demonstrated the benefits of using terms from ontology classes to obtain other term variants. The study opened up the research of the feasibility of different methods that can scale with big data and enable automation of machine learning analysis.
Sajjad, Bajwa, and Kazmi’s (2019) research was already looking at rule engines and producing rules in the era of big data. They proposed a method to work with the semantic complexity in the rules and then do an automated generation of the RDF model of rules to help in analyzing big data. Specifically, they used a machine learning technique to classify the Semantic of Business Vocabularies and Rules (SBVR) rule and map it to the RDF model. A challenge for the research included the automatic parsing of the rules as well as the semantic interpretation. Also, mapping the vocabulary to the RDF syntax to verify the RDF schema proven successful, but challenging. However, their work did show that it was possible to have consistency in checking a set of big data rules through automated tools. However, these scholars also found a need for a method to semantically analyze rules to help with the testing and validating as it relates to rule changes. Their particular system makes an ontology model that can be useful in the interpretation of a set of rules. This research supports both the semantic understanding of rules, but also generates the RFP model of rules that provides support for querying.
#MachineLearning #Knowledge #BigData #RDF #XML
Casteleiro, M. A., Demetriou, G., Read, W. J.,
Prieto, M. J. F., Maseda-Fernandez, D., Nenadic, G., … & Stevens, R.
(2016). Deep Learning meets Semantic Web: A feasibility study with the
Cardiovascular Disease Ontology and PubMed citations. In ODLS (pp. 1-6).
Sajjad, R., Bajwa, I. S., & Kazmi, R.
(2019). Handling Semantic Complexity of Big Data using Machine Learning and RDF
Ontology Model. Symmetry, 11(3), 309.
Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K per sponsored post. Just this year, location-based marketing is predicted to grow to $24.4 billion in ad spending. Nearly half of advertisers plan on using influencer marketing this year as real click rates can translate into purchased products and services.
As such, this market is ripe for cyber-attacks. However, one way to detect these hackers is to look at the IP size distribution or the number of users that are sharing the same source IP. IP size distributions are created from 1) actual users 2) sponsored providers that provide fraudulent clicks and 3) bot-masters with botnets. The good news is that most machine-generated attacks share an anomalous deviation from the expected IP size distribution.
Security frameworks, supported by machine learning techniques, have been implemented to automatically detect and group deviations. Most detection methods for these Gen 4 bots can be detected with behavioral analysis. Frameworks aggregate statistics around network traffic for investigation recommendations. For example, anomaly detection algorithms can be written to find unusual patterns that do not fit with expected behavior. Code can be written to run MapReduce in parallel processing, assigning a distinct cookie ID for each created click. Then a regression model can be used to compare the IP rates using Poisson distribution with a diverse explanatory model to count the unique cookies and measure the entropy relative to the distribution so that the accurate IP size can be determined. This data can also be analyzed using linear regression and percentage regression techniques to help identify the true IP size.
Some people have also leveraged historical data in helping create accurate IP size distributions. In this day and age, even a lack of historical data or constant cache cleaning can be used as an input to machine learning techniques to find hackers. However, these methods do depending on securing the click data to run the code to find the source of the fraudulent clicks or bonet behavior.
The next-generation bots are likely to have more advanced artificial intelligence (AI) making them harder to detect. As a result, AI-based bot detections algorithms need to stay on the leading edge to keep a fair playing field and prevent harm to society.
Do you have a drawer at home full of
cords? The challenge is that while the wireless charging market is in demand,
some strides still need to be made in terms of functionality. According to IHS,
the wireless power market is estimated to grow to one billion charging units by
2020. It’s not a new concept, and early adoption was seen with the
In 2017, Disney Research showcased the
ability to charge your device while the receiver is across the room, similar to
how WiFi works on a computer. Apple tried to enter the marketplace awhile back,
but because of their charging needs, the phones got too hot. Also, the length
of time it took to charge the device was originally an issue.
The way charging typically works is
that there is a charging dock that has a transmitting coil that induces current
into a receiver coil on your smart device that in turn charges that
battery. Early products required the exact positioning of the device on
the charger to work, but then ‘free positioning’ became a popular concept in
charging. The key to the first ‘free positioning’ concepts was the ability to
have a transmittable back surface on the device. Some smartphones have a
surface made of glass which helped, but the obvious drawback is when you drop
your phone. The industry continues to struggle with how to charge devices at a
It is easy to imagine a world where
wireless charging is available everywhere from hotels to airports. Some have
even imagined roads embedded with wireless charging to support electric cars.
Different solutions have hit the market
from chargeable phone cases to Disney’s vision of wireless charging hotspots.
Companies like Logitech and Corsair are currently selling wireless charging
technology that is transmitted via a mouse pad. Also, Apple recently re-entered
the market in submitting a patent that shows how devices can transmit wireless
power to nearby smart devices without using a wireless charging mat but instead
just wirelessly connecting to your home computer.
This year could be a hot patent space
for companies trying to find their foothold in this future market.
According to Gartner, the market for cloud computing will expand to 623 billion USD by 2030. With this growth, there is increased demand from cloud providers to make the best use of their resources in terms of performance efficiency. Three fairly common approaches to performance analysis include experiment-based performance analysis, discrete event-simulation-based performance analysis, and stochastic model-based performance analysis. However, the stochastic model-based performance analysis is the preferred method due to the lower cost point, as well as timeliness. Sakr and Gaber developed a simplified three-pool cloud architecture-based stochastic model to address some of the common barriers encountered with scaling.
Stochastic Model &
Three-Pool Cloud Architecture
Sakr and Gaber proposed a simplified
model from Markov’s stochastic model to support scalability and tractability
solution at a lower cost. Their model leverages three-pool cloud architecture
that has several interacting sub-models, including resource provisioning
decision engine, virtual machine sub-models, and pool sub-models. Sakr and
Gaber leveraged three-pool cloud architecture which includes the concepts of
the hot pool sub-model, warm pool sub-model, and cold pool sub-model. What
pool is used depends on how the response time and power consumption are
sorted. Each sub-model can be used to represent a machine in the pool. The
hot pool sub-model addresses the group needing maximum power with low response
times. The warm pool sub-model addresses machines in sleep model that are
waiting for the next run. Finally, the cold pool sub-model has machines that
are in the off state and have minimum power needs and longer response times.
Figure 1 shows Sakr and Gaber’s model of what happens when there is a service
request and how the resource provisioning decision engine tries to leverage the
various pool sub-models.
The biggest challenge to the model is
the potential service request rejection. When the buffer is at capacity or the
resource provisioning decision engine cannot find an available machine due to
capacity issues, request rejection is possible. However, there is some
understanding related to the service request probability that can be helpful.
For example, the longer the mean service time, the increased likelihood of
higher service requests. Therefore, if the capacity of the machines is
increased, the potential for service rejection can decrease.
Additional challenges for the
three-pool cloud architecture model to work effectively is that both service
requests and the machines have to have the same type. However, virtual
provisioning sub-models can be used with different machines where the machines
are grouped into classes, and each class is this represented by a pool so that
the individual pool is homogeneous.
Development of Applications
The advantage of the Stochastic Model
and pool architecture is that the pools can be strategically used to identify
performance bottleneck through what-if analysis and planning capacity. For
example, that Symbolic
Hierarchical Automated Reliability and Performance Evaluator (SHARPE)
is a modeling application that can look at performance, reliability, and
availability. This application has been installed at over 450 sites and lets
users choose different what-if analysis with alternative algorithms depending
on their objectives. Also, three-pool architecture can be useful for data
recovery efforts in the sense that mission-critical information can be in the
hot pool for faster recovery whereas less critical data can be stored in the
Cloud services are growing
exponentially in demand putting increased pressure on finding both timely and
cost-effective solutions to address performance issues. The Stochastic Model is
desirable because of its’ low cost and timeliness as compared to other models.
The Stochastic Model leverages a three-pool cloud architecture with pool
sub-models to process service requests. While there are some challenges related
to rejecting service requests and heterogeneous machines, the model can be
adapted to handle these challenges. Several applications have demonstrated the
promise of the pool architecture in strategically managing performance
Big data contains patterns and methods to inform companies about their customers and vendors, as well as help improve their business processes. Some of the biggest companies in the world like Facebook have used MapReduce framework as a tool for their cloud computing applications sometimes through implementing Hadoop, an open source code of MapReduce. MapReduce was designed by Google for parallel distributed computing of big data.
Before MapReduce, companies needed to pay data modelers and buy supercomputers to process timely big data insights. MapReduce has been an important development in helping businesses solve complex problems across big data sets like determining the optimal price for products, understanding the return on the investment of advertising, performing long term predictions and mining web clicks to inform product and service development.
MapReduce works across a network of low-cost commodity machines allowing actionable business insights to be more accessible than ever before. It is strong computation tool for solving problems that involve things like pattern matching, social network analysis, log analysis and clustering.
The logic behind MapReduce is basically dividing big problems into small manageable tasks that are then distributed to hundreds of thousands of server nodes. The server nodes operate in parallel to generate results. From a programming standpoint, this involves writing a map script where the data is mapped into a collection of key value pairs and writing a reduce script over all pairs with the same key. One challenge is the time it takes to convert and break the data into the new key-value pair which increases latency.
Hadoop is Apache’s open-source implementation of the MapReduce framework. In addition to the MapReduce distributed processing layer, Hadoop uses HDFS for reliable storage, YARN for resource management and has flexibility in dealing with structured and unstructured data. New nodes can be added easily to Hadoop without downtime and if a machine goes down, data can be easily retrieved. Hadoop can be a cost efficient solution for big data processing, allow terabytes of data to be analyzed within minutes.
But, cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google’s Cloud Platform offer similar MapReduce components where the operational complexity is handled by the cloud vendors instead of the individual businesses. Hadoop was known for its strong combination of computation with storage, but in place of HDFS, cloud-based object stores have been built on models like AWS which given the ability to still compute and use virtualization technology like Kubernetes instead of YARN. With this shift to cloud vendors, there have been some increased concerns around the long-term vision for Hadoop.
Hortonworks was the data software company that supported open-source software, primarily Hadoop. But in January 2019, Hortonworks closed an all-stock $5.2 billion merger with Cloudera. While Cloudera also supports open source Hadoop, it has a different vendor-lock management suite that is supposed to help with both installation and deployment whereas Hortonworks was 100% open-source. In May 2109, another Hadoop provider, MapR, announced they were looking for a new source of funding. On June 6, 2019, Cloudera’s stock declined 43% and the CEO left the company.
Understanding the advantages and disadvantages of the MapReduce framework and Hadoop in big data analytics is helpful to making informed business decisions as this field continues to evolve. In terms of the drawbacks of Hadoop, Monte Zwebe, the CEO of Splice Machine, that creates relational databases for Hadoop says, “When we need to transport ourselves to another location and need a vehicle, we go and buy a car. We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.”
What do you think? Please DM me or leave your feedback in the comments below.
The Internet of Things (IoT) are multiplying as technology costs decrease and smart device sales increase. Generally speaking, if there is a device with an on and off switch, there is a likely chance that it will become a future part of the IoT movement. IoT architecture includes the sensors on devices, the Internet, and the people that use the applications.
IoT devices are connected through Internet infrastructure and different wireless networks. Smart devices by themselves are not that good at dealing with massive amounts of data, let alone learning from the data received and generated. Currently the data from IoT devices is relatively basic because of the small computing power and limited capacity to store data on most devices. However, that basic data gets transferred to a data processing center that has more advanced computing capability to produce desired business insights.
IoT smart devices require unique addresses that allow them to connect on the Internet. There are some challenges as it relates to accessing these new places on the Internet with growing amount of smart devices. Internet Protocol version 4 (IPv4) has the capacity for about 4.3 billion addresses. Gartner estimates that by 2020, the world will have over 26 billion connected devices. However, there are several thought-leaders as it relates to a unified addressing scheme for IoT that may help solve this bottleneck.
IoT applications also have bottlenecks around the quality of the current artificial intelligence algorithms. For example, having increased transparency and reduced bias around algorithms continues to peak the interest of citizens and could pose challenges to some proprietary business models. With machine learning, producing training sets that are actually representative of targeted populations also remains a challenge.
There are some additional obstacles as it relates to the physical path of the transmission media. For example, IoT can receive or transmit data based on a variety of technology from RFID to Bluetooth. The common problems associated with these kinds of transmission media from bandwidth to interference also creates problems for IoT. Trying to optimize transmission media is a challenge in IoT applications as it relates to supporting and sustaining networks.
Security is also an ongoing concern to IoT since the basic data feeds into a receiver on the internet. Many IoT devices are low powered constrained devices making them more susceptible to attack. Security challenges of IoT include the ability to ensure that the data has not been changed during transmission and protecting data from unwanted exposure. The World Economic Forum estimates that if a single cloud provider was successfully attacked, it could cause $50 billion to $120 billion of economic damage. With the growth of poorly-protected devices on a shared infrastructure, there is a wide attack surface for hackers where IoT botnets could send swarms of connected sensors information through a variety of IoT devices like thermometers, sprinklers and other devices. A recent State of IoT Security Research report shared that 96 percent of businesses and 90 percent of customers think there should be IoT security regulations. As public confidence decreases in security while IoT sales increase, this is likely to result in regulatory reform.
IoT allows businesses to solve problems and even delight their customers by leveraging the intelligence of connected devices. While there is always uncertainty and risk involved with new technology, and customer confidence around IoT may be hit or miss, the promise of IoT is a fully connected world where devices connect together and with people to enable action that has never before been possible.
If the Internet is a bookstore, the World Wide Web is the collection of books within that store. The Web is a collection of information which can be accessed via the Internet. The Web was created in 1989 by Sir Tim Berners-Lee and remained quiet through the 1990s, but as users increased, companies like Google started to develop algorithms to better index content which eventually lead to the concept of SEO (a significant driver of the Internet today). Sir Tim Berners-Lee’s initial vision of the Web was explained in a document called, “Information Management: A Proposal,” but today with Facebook and social media, the focus has also changed the Web into a communication tool.
Back in 1989, Sir Tim Berners-Lee wrote about three fundamental technologies that are still foundational to the Web today which include HTML, URI, and HTTP. HTML refers to the markup language of the Web, URI is like the address or URL, and HTTP supports the retrieval of linked items across the Web. These core technologies used in Web 1.0 are responsible for today’s large-scale web data. Back in Web 1.0, bottlenecks included web pages that were only understandable by a human. Also, Web 1.0 was slow and pages that needed to be refreshed often. In retrospect, it is easier to identify that Web 1.0 had servers as a major bottleneck and lacked a sound systems design with networked elements. Nonetheless, Web 1.0 is referred to as the “web of content” and was critical to the development of Web 2.0.
Web 2.0 began in 1999 and let people contribute, modify, and aggregate content using a variety of applications from blogs to wikis. This was revolutionary in the sense the web moved from being focused on content to being focused on the communication space, where content was created by individual users instead of just being produced for individual users. Web 2.0 embraced the reuse of collective information, crowdsourcing, and new methods for data aggregation. In terms of online architecture, Web 2.0 drove collaborative knowledge construction where networking became more critical to driving user interaction. At the same time, issues of open access and reuse of free data started to surface. Performance issues were encountered with frequent database access, which put a strain on Web 2.0’s scalability. However, the good news is that Web 1.0 bottlenecks on the database server side were eliminated with the ability to have databases on ramdisk and high-performance multi-core processors that supported enhanced multi-threading. However, with the benefits of Web 2.0’s flexible web design, creative reuse, and collaborative content development, bottlenecks were created by the increased volume of content by users.
Web 3.0 started around 2003 and was termed the “the web of context.” Web 3.0 is the era of defined data structures and the linking of data to support knowledge searching and automation across a variety of applications. Web 3.0 is also still referred to as the “semantic” Web, which was revolutionary in the sense that it shifted to focus to have the Web not only read by people, but also by machines. In this spirit, different models of data representation surfaced, like the concept of nodes, which lead to the scaling of web data. One of the challenges of the Web 3.0 data models was that the location and extraction processes turned into a bottleneck.
Web 4.0 began around 2012 and was named the “web of things.” Web 4.0 further evolved the concept of the Web into a symbiotic web that focused more on the intersection of machines and humans. At this point, Internet of Things devices, smart home and health monitoring devices started to contribute to big data. Mobile devices and wireless connections helped support data generation, and cloud computing took a stronghold in helping users both create and control their data. However, bottlenecks were created with the multiple devices, gadgets and applications that were connected to Web 4.0 along with changing Internet of Things protocols and exponentially growing big data logs.
Web 5.0 is currently referred to as the “symbiont web” or the web of thoughts. It was designed in a decentralized manner where devices could start to find other interconnected devices. Web 5.0 creates personal servers for personal data on information stored on a smart device like a phone, tablet, or robot. This enables the smart device to scan the 3D virtual environment and use artificial intelligence to better support the user. The bottleneck in Web 5.0 becomes the memory and calculation power of each interconnected smart device to calculate the billions of data points needed for artificial intelligence. Web 5.0 is recognized for emotional integration between humans and computers. However, the algorithms involved in understanding and predicting people’s behavior have also created a bottleneck for Web 5.0.
Where will Web evolution end? One thing is for sure, data generation is increasing year after year. To continue to get new functionality out of the evolving Web, new bottlenecks need to be addressed. There are a variety of future considerations as it relates to anticipated bottlenecks from encoding strategies to improving querying performance. However, the best way to predict what will happen in the future is to invent it.
Shannon Block is an entrepreneur, mother and proud member of the global community. Her educational background includes a B.S. in Physics and B.S. in Applied Mathematics from George Washington University, M.S. in Physics from Tufts University and she is currently completing her Doctorate in Computer Science. She has been the CEO of both for-profit and non-profit organizations. Follow her on Twitter @ShannonBlock or connect with her on LinkedIn.
As the elderly populations rise, so does medical care costs
that come with treating those that need to be served. Medicare provides
insurance to those 65 and older to help with the financial burden of
healthcare. Medicare costs about $588 billion and is expected to increase
by 18% in the next decade. Healthcare fraud is estimated by NHCAA to be as
much as 10% of the nation’s total healthcare spend, or $58.8
billion. Fraudulent claims include both patient abuse or neglect, as well
as billing for services that were not received. By using publicly
available claims data, machine learning can be used to help detect fraud in the
Medicare system helping reduce the cost to taxpayers.
Machine learning is a subset of artificial intelligence that
can find a fraudulent needle in the haystack by applying continuous learning
algorithms. With each instance that the algorithm is right about a
fraudulent transaction, that information goes back into the equation, making it
smarter. The same happens when the algorithm is wrong.
Using unsupervised machine learning on publicly available
datasets is a growing trend with great potential. The publicly available
Medicare claims data has 37 million cases. In machine learning, an
essential part of the process is labeling as it affects both the data quality
and the performance of the model. Different researchers have created the
labels for fraud and non-fraud by mapping the data with other publicly
available resources like the National Provider Identifier and List of Excluded
Individuals and Entities database. The 37 million cases can then be reduced to
under 4 million that can be run through the machine learning algorithm to help
identify fraudulent providers.
For example, unsupervised machine learning has been used
successfully on Florida’s Medicare data to detect anomalies in Medicare
payments using regression techniques and Bayesian modeling. Also, decision
tree and logistic regression with random undersampling class distributions have
provided some promising results. Initial results have indicated that
having more non-fraud cases has helped the model learn better and produce more
accurate results between fraud and non-fraud cases.
Using machine learning to detect fraud is
game-changing. Machine learning allows humans to be notified early on in
the fraud attempt, stopping losses earlier on in the process. Having a
continuous look on publicly available data can go a long way in helping
minimize fraudulent claims and accelerate the time to prosecute
A multi-billion dollar industry exists from the buying and selling of your healthcare data. Certain state exceptions under federal privacy rules allow hospital data to be sold to data brokers. Private companies are seeking to gain access to your medical records to advance their mission, but sometime also to make a quick buck.
The right of businesses to profit from health information without patient permission has been previously upheld by the United States Supreme Court. For example, in the 1990s, a data broker was selling data to some big pharmaceutical companies on what individual providers were prescribing to patients. These pharmaceutical companies then used that information to provide targeted marketing to prescribers for the purposes of increasing drug sales. However, once patients started to understand and voice their complaints, a couple of states passed legislation to limit the trade of prescriber specific information. But, the data broker objected so the case went to the Supreme Court and was won by the data broker on the grounds of free speech.
While the practice of buying and selling medical data is technically acceptable under the Health Insurance and Portability and Accountability Act (HIPPA) because the data is supposed to be anonymous, one of the challenges with the increasing number of these deals is patient privacy is at risk since it is easier now to piece together deidentified records using unstructured data sources like Facebook, Twitter and other social media platforms.
However, it is also important to note that not all data brokers have misguided intent. There are many organizations in this space with honorable missions. For example, Sloan Kettering made a deal to sell pathology samples to Paige.AI to develop artificial intelligence to help in finding a cure to cancer. In the case of curing cancer, the patient’s medical data is being used to increase the quality of care. However, data brokers do not currently have any fiduciary responsibilities to patients.
There are some considerations that health systems can put in place to help reinforce ethical best practices:
1. Only enter into a data transfer deal if it benefits patients
2. Have a separate agreement form from the consent form that patients complete for their normal healthcare
3. Asking the patient for permission to sell their data should be done by the third party vendor to ensure that there is no misunderstanding or abuse of the patient/provider relationship
4. Any default consent options should be that patients do not elect to have their data sold
5. Consent language should be worded in an easy to understand fashion and potentially in video form for so that patients can clearly understand usage, risks, and their options
6. Transparency should be provided to the patients and healthcare staff on how the records are being used, who owns the data, and in what way it will be used, especially if there is a financial gain for the health system
Last year GlaxoSmithKline, a large pharmaceutical company came under global scrutiny when they tried to invest $300 million in 23andMe, due to concerns around lack of transparency of what data was being shared combined with the lack of choice for patients to participate.
Given that researchers predict that healthcare data will grow faster than in manufacturing, financial services, or media experiencing a compound annual growth rate of 36 percent through 2025, these issues are likely to continue to surface for governing bodies as well as public policy influencers.
What has been your experience with data brokers? How do you think this will play out in the future?
According to the federal government in June 2019, there were 3.5 million people’s data exposed in healthcare data breaches that were reported. The majority of that data breach was from Dominion National that claims the incident may have started as early as April 2010. The data accessed included access enrollment, demographic data, and associated dental and vision information. Similarly, LabCorp and Quest Diagnostics reported in June 2019 that there was a data breach from an unauthorized user that accessed their vendor payment system that affected nearly 8 million and 12 million patients, respectively. These alarming numbers do not even include encrypted data that is lost by organizations since HIPAA does not consider the loss of encrypted data a breach. The United States healthcare system as a whole lost $6.2 billion in 2016 from data breaches with the average data breach costing a company $2.2 million. Research from IBM Security found that in 2018, the cost to healthcare organizations was $408 per record, up from $380 per record in 2017.
According to a HIMSS 2019 Cybersecurity Survey, 59 percent of all data breaches in the past 12 months started with phishing, or when an attacker masquerades as another reputable person in an email or other communications. Cybercriminals also often change their approach and are now increasingly using techniques powered by artificial intelligence. In response, healthcare organizations are actively deploying artificial intelligence solutions to combat suspicious activities, as well as increasing employee education and cloud-based security.
There are some basic techniques that healthcare organizations should be deploying in addition to conducting risk assessments and providing employee education. For example, healthcare organizations should:
Take time to understand cloud service-level agreements, retain ownership of data that can be accessed in the event of a crash, and ensure service-level agreements comply with state privacy laws
Establish subnet wireless networks for guests and other public types of activity
Use multi-factor authentication on employee devices
Use business association agreements to help distribute risk and clarify vendor reporting requirements
Have a “bring your own device policy” based on current best practices like having a complex password requirements and policies that can be enforced
Plan for the unexpected in thinking about how long the healthcare organization can function in different areas without data, while also having an emergency solution for back-up information and data restoration
These tips can be incorporated into the organization’s cybersecurity framework. There are benefits to thinking through some of these strategies before they are mandated to have an effective cyber-defense program that protects both patients and the organization.