Optimizing the use of Pandas
One of the most common problems that we find when working with datasets is the time that the Pandas library can take to load the data. As an example we are going to take a dataset with 15Million records and 35 columns. If we load our dataset of example in the usual way with the function read_csv of pandas:
We check that it has taken 47.4 seconds to load the complete dataset. It may seem little 47 seconds, but if it is something that we have to do every day or even several times a day and not only one person but for example a team of 4-5 engineers can become a little desperate and quite expensive (5 engineers if they load the dataset on average 2 times a day -> 2370 seconds a week, more than half an hour lost in loading datasets).
There are different alternatives available that allow us to do the same, but with greater efficiency, for example, the datatable library, let’s see it in action with our dataset:
As we can see it now takes only 26 seconds to load the same dataset, we have saved almost 50% of time.
Veamos un ejemplo de cómo podemos ahorrar memoria fácilmente, que es otro de los grandes problemas cuando trabajamos con datasets complejos. Los diferentes tipos de datos que tenemos en pandas son:
|Str or Mix
|Text or Mix of numeric and non-numeric values
|Floating Point Values
|True or False
|Date and Hour
|Difference between two dates
|Finite list of text values
The most consuming type is the dtype object, then float and int. A variable instantiated with any of the available types consumes a part of the memory of the computer we are using, so the idea is to declare each variable we have in the type that consumes less memory. From the official documentation of numpy we can see all the available data types:
|Boolean, true or false, stored in a byte
|Default integer, usually int64 or int32
|As in C
|Integer used for indexing
|Byte (-128 a 127)
|Integer (-32768 to 32767)
|Integer (-2147483648 to 2147483647)
|Integer (-9223372036854775808 to 9223372036854775807)
|Positive integer (0 to 255)
|Positive integer (0 to 65535)
|Positive integer (0 to 4294967295)
|Positive integer (0 to 18446744073709551615)
|Floating comma, half-precision, exponent with 5 bits, mantissa with 10 bits
|Floating comma, exponent with 8 bits, mantissa with 23 bits
|Floating comma, double precision, exponent with 11 bits, mantissa with 52 bits
|Complex number represented by two 32-bit floating point numbers
|Complex number represented by two 64-bit floating point numbers
Normally we are interested in using float16, float32 and init8, int16 or init32. We create a function to make this conversion, the function what it is going to do is for each column of our dataframe we check if it is of type integer(int) or floating point(float), we extract the minimum value and the maximum value for each column and we go checking from smaller to bigger if it could fit in the data type that less memory consumes, for example if the minimum value of the column is -100 and the biggest is 15.000 we can transform it to int16 and so on, let’s see the execution of our function with our example dataset:
We have saved 29.2% of memory.
We check that we have the same dataset:
As we can see we have the same rows, columns and data.
Acerca del autor
Juan Manuel Pruaño
With over 15 years of proven experience as a service provider specialized in the SAP technical module (SAP BASIS), he has worked for top level consulting companies and end customers both in Europe and the USA.
In the last 10 years he has developed his activity in highly secured SAP environments, subject to strong legal restrictions and continuously audited.
He is currently part of the OGA team as CTO / AI Lead and contributes with his multidisciplinary technical expertise.