Introduction
Anyone who wants to enter the world of data science must be familiar with at least one of the two predominant programming languages in the area: R or Python.
The CPE Institute made a note in which it was discussed, which was more convenient (link to the note). In this article, we will focus on describing which packages or libraries are the most common when using Python for data science.
What is a Bookstore?
When programming, you use code that belongs to a particular language, in this case, Python, which is executed on a computer to obtain a result that the programmer intends to obtain.
In data analysis, this usually involves importing and cleaning a dataset, generating a visualization or dashboard, or training and implementing a machine learning model.
Most data analysts would find any of these objectives unattainable since generating the code for the computer to perform each of these actions is extremely laborious and, in many cases, complex.
Therefore, the programming community devised a way of collaborating in this matter, leading to the development of libraries or packages.
A library or package is nothing more than a series of files containing code developed by another programmer or team of programmers, which is made available to the community. In this way, if I need to perform some data analysis action, such as importing a CSV file for analysis, I do not have to fully code what the computer should do to achieve my goal of storing a CSV file in a Python class, but by importing a package containing a function generated by another programmer, I can achieve that goal practically effortlessly.
This methodology is essential for optimizing programming processes. If each programming project had to start from scratch, without any predefined functions or classes, the work time each would take would increase incredibly.
The same happens at the level of data analysis, and it is especially important in this discipline since professionals in this field do not always have solid training in programming and computer science.
Fortunately, with the knowledge and use of the appropriate libraries, excellent results can be achieved without the need to know in depth (although it is never a bad idea to do so) how functions and classes work. Below, the main Python packages for data science will be mentioned, along with their function.
Pandas
The first step in any data analysis is typically to access the data. For everything related to descriptive analysis and data cleaning, the most well-known Python package is Pandas.
This library focuses on providing methods to obtain datasets from various sources and a class to store them: the data frame. It also allows for individual analysis of columns through a class called Series.
Originating in 2008, Pandas allows the analyst to use a huge range of functions to manipulate data, using a very friendly and direct syntax.
NumPy
Data science usually requires the use of mathematical and statistical techniques, and the library par excellence for this is NumPy. This project, which began in 2005 and has become one of the most downloaded libraries in 2021, is established as one of the main tools for data science.
It is especially useful in managing information matrices, with the possibility of performing high-performance calculations very efficiently. In turn, it has built-in a huge number of functions that perform mathematical calculations that are necessary for data science.
Beautiful Soup
Data scientists often obtain information from non-traditional sources. Since the possibility of scraping, that is, obtaining data from web pages was generated, the most famous package in this sense is Beautiful Soup. With this package, you can easily and efficiently obtain the information stored in a web page’s interface.
Matplotlib & Seaborn
Another first-line need in data analysis is the generation of visualizations. In that sense, it is impossible to avoid the presence of Matplotlib and Seaborn. Both libraries are widely used for data science, with Matplotlib being the oldest and best known and Seaborn an emerging package that relies precisely on Matplotlib code. Therefore, the use of both libraries is a relevant synergy for data science.
Dash
The next step in data visualization is to generate an interactive dashboard. Dash is an excellent tool for this task in Python. This library is built on Front-End packages for web development.
Its main use is the possibility of generating an interactive interface that links the analyst’s data processing with the participation of the user, usually to filter or segment the information.
Scikit Learn
This package began developing in 2007 as a Google summer project and has become the reference for training machine learning models in Python.
Scikit Learn is synonymous with machine learning within this programming language and provides a wealth of methods, data wrangling, and feature engineering tools.