Optimizing the use of Pandas

Loading Datasets

One of the most common problems that we find when working with datasets is the time that the Pandas library can take to load the data. As an example we are going to take a dataset with 15Million records and 35 columns. If we load our dataset of example in the usual way with the function read_csv of pandas:
Optimizing Pandas. Loading datasets
  We check that it has taken 47.4 seconds to load the complete dataset. It may seem little 47 seconds, but if it is something that we have to do every day or even several times a day and not only one person but for example a team of 4-5 engineers can become a little desperate and quite expensive (5 engineers if they load the dataset on average 2 times a day -> 2370 seconds a week, more than half an hour lost in loading datasets). There are different alternatives available that allow us to do the same, but with greater efficiency, for example, the datatable library, let’s see it in action with our dataset:
Optimizing Pandas. Loading datasets - datatable
As we can see it now takes only 26 seconds to load the same dataset, we have saved almost 50% of time.

Saving RAM

Veamos un ejemplo de cómo podemos ahorrar memoria fácilmente, que es otro de los grandes problemas cuando trabajamos con datasets complejos.  Los diferentes tipos de datos que tenemos en pandas son:  
Pandas dtype Python type Uso
object Str or Mix Text or Mix of numeric and non-numeric values
int64 int Integer Values
float64 float Floating Point Values
bool bool True or False
datetime64 datetime Date and Hour
timedelta Doesn’t exist Difference between two dates
category Doesn’t exist Finite list of text values
  The most consuming type is the dtype object, then float and int. A variable instantiated with any of the available types consumes a part of the memory of the computer we are using, so the idea is to declare each variable we have in the type that consumes less memory. From the official documentation of numpy we can see all the available data types:  
Data Type Description
bool_ Boolean, true or false, stored in a byte
int_ Default integer, usually int64 or int32
intc As in C
intp Integer used for indexing
int8 Byte (-128 a 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Positive integer (0 to 255)
uint16 Positive integer (0 to 65535)
uint32 Positive integer (0 to 4294967295)
unit64 Positive integer (0 to 18446744073709551615)
float_ Floating Point
float16 Floating comma, half-precision, exponent with 5 bits, mantissa with 10 bits
float32 Floating comma, exponent with 8 bits, mantissa with 23 bits
float64 Floating comma, double precision, exponent with 11 bits, mantissa with 52 bits
complex_ Complex
complex64 Complex number represented by two 32-bit floating point numbers
complex128 Complex number represented by two 64-bit floating point numbers
  Normally we are interested in using float16, float32 and init8, int16 or init32. We create a function to make this conversion, the function what it is going to do is for each column of our dataframe we check if it is of type integer(int) or floating point(float), we extract the minimum value and the maximum value for each column and we go checking from smaller to bigger if it could fit in the data type that less memory consumes, for example if the minimum value of the column is -100 and the biggest is 15.000 we can transform it to int16 and so on, let’s see the execution of our function with our example dataset:
Optimizing Pandas. Function
We have saved 29.2% of memory. We check that we have the same dataset:
Optimizing Pandas. Checking datasets
As we can see we have the same rows, columns and data.

Acerca del autor

Autor
Juanma Pruaño oga
Juan Manuel Pruaño
Senior Data Scientist en oga

With over 15 years of proven experience as a service provider specialized in the SAP technical module (SAP BASIS), he has worked for top level consulting companies and end customers both in Europe and the USA.

In the last 10 years he has developed his activity in highly secured SAP environments, subject to strong legal restrictions and continuously audited.

He is currently part of the oga team as Senior Data Scientist and contributes with his multidisciplinary technical expertise.