Data ethics: privacy and bias.

We live in a society in which data has replaced oil as the element of greatest commercial value.

The ability to aggregate huge amounts of data allows large technology corporations to create new business models and protect them with high barriers to entry. On a smaller scale, other companies will need data in order to make their products and services more valuable to their customers or at a lower operational cost. The management of these data will need to consider at least two casuistries that will be very relevant in the exploitation of these data:
  • Privacy, in the strictest sense of respect for the personal information that the data may entail and.
  • Bias, or the non-correspondence between the mathematical model and reality, caused by the data used to train the model.

Privacy

Respect for the privacy of individuals is strongly regulated, at least in the European Union, which establishes certain levels of ‘sensitivity’ that a particular piece of data has, and the treatment that must be applied according to that level of sensitivity. We can all assume that it is not as sensitive to know the sales of a certain salesperson as it is to know his or her marital status or possible health problems. But there are times when this is not so obvious or involves situations that out of the initial context affect people’s privacy. In the United States, the possibility of recording the metadata of all telephone calls made by its citizens has been discussed. Note, the metadata, not the voice itself. At first glance, it may seem that this would not provide private information, but later studies showed that it was possible to detect when a person was unfaithful to their partner, or when they had lost their job or had a serious illness, simply by using such metadata. Sometimes it is hard to imagine how far one can go in the most ‘innocent’ way. Another example. European guidelines prohibit tracking an ‘identifiable’ person. What would happen if our company decided to implement GPS tracking devices in company vehicles in order to improve the route planning and tracking of its sales representatives? When the employee uses that vehicle for some personal reason, we would also be looking at that part of his life. Is it legal? How can we avoid falling into illegality? In other cases, only technology can solve a moral or legal dilemma. Let’s imagine that we want our hospital to participate in a clinical study on cancer screening with other reference medical centers. Is it possible to do so without violating the privacy of our patients’ medical data? This is known as federated learning and is a solved case, but this will not always be the case. It is often said that when an Internet service is free, we are the money, or rather, our data is ‘the payment’. Most likely, giants like Google or Facebook know more about a person than he or she is able to imagine, and this information is used to segment the information and advertising that reaches us.

Bias

As mentioned earlier, a model is said to be biased when it does not reflect the underlying reality. The consequences of this bias can be very high, especially when it involves factors that our society considers immoral, such as bias by race or sexual status. Where does the bias come from? We must understand that to train a model we need data, lots of data, but it will never be ‘all’ the data. Because of the way in which they are collected or treated, bias can appear in the training of the model. Bias can be accidental, for example, when the data reported by a temperature sensor is incorrect because it is faulty. It can also be caused by poorly defined data collection – for example, if we train a face recognition model using photos from a single geographic area. It can be caused by an erroneous treatment of the data by the analyst to prepare it, either by eliminating a relevant factor (age in the case of diseases), or by keeping an inadequate one (race in a credit scoring). Consider the case of a doctor in the USA who, using the medical scoring model implemented in her hospital, denied the operation she herself considered the best option for her colored patient. When she changed the patient’s race to white, the model recommended the operation. This highlighted the presence of racial bias in the models used, and as a result, studies were done to demonstrate it. Although the measures to avoid bias will come fundamentally from a very careful definition and treatment of this phase of data collection and treatment, it is also possible to apply some measures in the very conception of the model. For example, ‘helping’ the model not to overestimate certain conditions, imposing counterweights in its objective (avoiding overfitting or over-training, which occurs when we train a model with a data set that is insufficiently representative of the scenario to be analyzed). As we can see, the dilemma facing the company is not an easy one. It involves legal, procedural, technical and even moral issues.    If you need help in the definition and design of these mechanisms, OGA can help you. Do not miss the opportunity to attend next April 22, at the Tech Park in Malaga, the First Meeting on ‘Ethics in artificial intelligence‘ where, with the help of APD, we will have the presence of Carmen Artigas, Secretary of State for Digitalization and AI. Meeting point where we will have the opportunity to discuss, along with prominent speakers, these interesting and increasingly important aspects of the digital transformation that the application of AI is generating in our economy and in our society.

Acerca del autor

Autor
Jaime Nebrera oga
Jaime Nebrera
Big Data Consultant / Project Manager en oga

Consultant specialized in new technologies and Big Data.

Pioneer in Spain in the use of cutting-edge technologies such as Apache Kafka and Druid, he has extensive experience in the design of innovative technological products.