Introduction - GDPR, AI and CLAUDETTE
The General Data Protection Regulation (GDPR, Dutch: AVG) is an EU law on personal data protection in the European Union. After the GDPR became effective on the 25th of May 2018, you might have received an email or two, informing you that the online services you use have amended their privacy policies. How many of them have you actually read?
In a study report, funded by the European Consumer Organization (BEUC), fourteen large companies were selected and their respective privacy policies were analyzed. These fourteen policies, put together, are about 80.000 words long. For us, as a customer, it takes a very long time to read through these privacy policies. There is just too much to read. Does it mean that our rights will remain just a nice theory on paper? Not necessarily so. Help might come from the technology that some fear will bring more harm than good: artificial intelligence.
In this blog, I will discuss how the CLAUDETTE project can help to analyze terms of services, how it is set up and how the first results can be interpreted.
The CLAUDETTE project
The CLAUDETTE project (Automated CLAUse DETeCTOR) has been established in order to attempt to automate the legal analysis of terms of service and privacy policies of online platforms and services. If machines can detect spam and translate from one language to another, operate driverless cars and trade stocks, the question if a machine can also assist lawyers, trying to pursue the rights of consumers, pops up. Having established that, the answer is: “yes, most probably they can!”. The CLAUDETTE project consists of a preliminary study, finished roughly a month after the GDPR came into effect. By then, most online platforms and services had just amended their privacy policies. The researchers read as many privacy policies as they could, and tried to train the machine to analyze them as good as possible. Their findings were both promising (regarding the possibility to have AI-powered tools assist human lawyers in evaluating the privacy policies) and alarming (regarding the content of the privacy policies being studied). Hence, they decided to make the results, as preliminary as they were, immediately available to the public. In this blog, I will discuss how the project was set up and how the first results can be interpreted.
After the GDPR came into effect, many online platforms and services reviewed their privacy policies and shared them with their visitors. Because a visitor usually does business with many online platforms and services, they have to read a large number of privacy policies to fully understand what those online platforms and services do with their personal data. To take it even further, there are many visitors doing business with many online platforms and services, so there are many privacy policies to be read by a large group of people. That is where information technology - AI in particular - can assist us.
In traditional programming, a software engineer trying to solve a problem, would sit down and try to figure out a way for a computer to do something. Some tasks, like mathematical problems, are easy; others, like automatic translation and image recognition, are hard. With the introduction of machine learning, this idea is turned upside-down. Instead of telling a computer how to perform a particular task, a programmer feeds it with an enormous amount of data, both regarding the input and output, and lets the machine figure out by itself how to do this, and sometimes even determine what the task is. Successful applications of machine learning are currently used by numerous consumers – such as SPAM filters, machine translation or voice recognition. Now, the challenge awaits us, can we also apply machine learning to evaluate privacy policies based on what the GDPR specifies?
How do you evaluate Privacy Policies
Privacy policies should be comprehensive (regarding the information they provide), comprehensible (regarding the form of expression), and substantively compliant (regarding the types of processing they address). Let me explain each aspect in a bit more detail:
- Comprehensiveness of information: the policy should include all the information that is required (by articles 13 and 14 of the GDPR);
- Substantive compliance: the policy should only allow for the types of processing of personal data that is compliant with the GDPR;
- Clarity of expression: the policy should be framed in understandable and precise language.
With regard to these three aspects, we can distinguish two levels of achievement:
- Optimal achievement: In this case, the policy clearly meets the GDPR requirements;
- Suboptimal achievement: In this case, the policy fails to clearly meet the GDPR requirement for the focal aspect issue. We distinguish two levels of suboptimal achievement:
- Questionable achievement: it may be reasonably doubted that the suboptimal policy reaches the threshold required by the GDPR;
- Insufficient achievement or no achievement: the suboptimal policy clearly fails to reach the threshold required by the GDPR.
With the aspects and levels of achievements being set, it is possible to evaluate privacy policies and to make a statement about their quality. Let me give you some examples for each of the three aspects mentioned above:
Comprehensiveness of information
"The controller will use a variety of third-party service providers to help us provide services related to Our Platform and the Payment Services. Service providers may be located inside or outside of the European Economic Area (“EEA”). In particular, our service providers are based in Europe, India, Asia Pacific, and North and South America."
This clause fails along the aspect of comprehensiveness since it does not identify the recipients of the information.
Clarity of expression
"When you as a Guest submit a booking request, certain information about you is shared with the Host (and Co-Host, if applicable), including your full name, the full name of any additional Guests, your cancellation history, and other information you agree to share."
This clause fails along the aspect of clarity since it does not specify what information will be transmitted to the Host, in addition to the items expressly mentioned (“certain information (…) including”).
"The controller may provide information to its vendors, consultants, marketing partners, research firms, and other service providers or business partners."
This clause fails along the aspect of substantive compliance since it fails to specify under what conditions and compatible purposes the data will be communicated to third parties, or who these third parties are.
The purpose of the machine learning system, proposed by the CLAUDETTE project, is to automatically identify clauses that appear to be defective along with at least one of the aspects above, and in this way to support experts by preselecting the clauses they should critically examine. For this purpose, the computer system has to be trained to recognize such clauses and is therefore provided with a set of examples. The examples consist of policies where relevant clauses have been appropriately tagged, distinguishing their category and whether they are optimal or defective. A single clause in some cases may fall into different categories and consequently have multiple tags. An example of this is shown below:
"We automatically collect log data and device information when you access and use Our Platform, even if you have not created an Account or logged in. That information includes, among other things: details about how you’ve used Our Platform (including if you clicked on links to third party applications), IP address, access dates and times, hardware and software information, device information, device event information, unique identifiers, crash data, cookie data, and the pages you’ve viewed or engaged with before or after using Our Platform."
This clause fails on both the aspects of clarity and substantive compliance: on the one hand it vaguely refers to the information being collected through the expression “among other things”; on the other hand it allows for types of processing having no relevant purpose.
For each of the three aspects, several types of clauses were identified during the project:
|Aspect||Type of clauses||# of specific clauses|
For each of these 23 (specific) clauses, a verdict can be given after evaluation, as seen in the examples shown above.
Automated analysis of privacy policies by CLAUDETTE
So, is it possible to automate the process of evaluating privacy policies based on the three aspects and their clauses, based on the GDPR? The CLAUDETTE project concludes that the answer to this question is "yes".
A web crawler was developed that monitors the privacy policies of a list of online services. The data retrieved by the crawler was then processed using supervised machine learning technology. In particular, a Support Vector Machine-based classifier trained on the data set annotated by experts was implemented. Such a data set contains over 3500 sentences taken from 14 privacy policies. The accuracy of the classifier was evaluated using a standard leave-one-document-out procedure, showing encouraging precision/recall in several subtasks. The analysis, done by the participants involved in the CLAUDETTE project, indicates that the task of identifying problematic clauses in these kinds of documents is basically automatable, but a larger and more sophisticated training data set is necessary.
At the time when the CLAUDETTE project evaluated the 14 privacy policies (June 2018), their study suggested that the privacy policies of online platforms and services still had a significant margin for improvement. None of the 14 analyzed privacy policies got close to meeting the standards put forward by the GDPR. Unsatisfactory treatment of the information requirements, large amounts of sentences employing vague langue, and an alarming number of “problematic” clauses cannot be deemed satisfactory.
The results that were achieved are, however, promising. Clearly, more research is needed, especially regarding the creation of more training data for optimizing the machine learning algorithm, but the study suggests that the fields of legal informatics and applied machine learning can be significantly pushed forward and by doing so, they can contribute to making it possible to use AI to evaluate privacy policies, based on the GDPR.
Let me conclude by sharing a link, containing the preliminary results of the CLAUDETTE study, with the evaluation of the 14 selected online platforms and services as a first attempt to, one day, automate the (legal) evaluation of the privacy policies, under GDPR, using machine learning.