plotting in Python with Seaborn: Line plot

Analysis
Visualization
Python
Published

February 22, 2023

Introduction

In visualization with Seaborn of this series, we were introduced on the power visualization and dove into distributions plot. In this post we are going focus on jointplot. jointplot is used to plot the histogram distribution of two columns, one on the x-axis and the other on the y-axis. A scatter plot is by default drawn for the points in the two columns. Seaborn has nifty function called jointplot(), which is dedicated for this type of plot.

Loading libraries

Though most people are familiar with plotting using matplot, as it inherited most of the functions from MatLab. Python has an extremely nady library for data visualiztion called seaborn. The Seaborn library is based on the Matplotlib library. Therefore, you will also need to import the Matplotlib library.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

sns.set_theme()

Dataset

We are going to use a flights dataset, which has 10 years of monthly airline passenger data. We do not need to download this dataset as it comes with the seaborn package. We only need to load it from the package into our session using sns.load_dataset function and specify the penguins as the name of the dataset and assign it as df;

df = sns.load_dataset('flights')
df
     year month  passengers
0    1949   Jan         112
1    1949   Feb         118
2    1949   Mar         132
3    1949   Apr         129
4    1949   May         121
..    ...   ...         ...
139  1960   Aug         606
140  1960   Sep         508
141  1960   Oct         461
142  1960   Nov         390
143  1960   Dec         432

[144 rows x 3 columns]

A printed df dataset shows that a flight dataset is made up of three variables — year, month, and number of passenger between January 1949 and December 1960, which are arranged in long format. To draw a line plot using long-form data, we simply assign the x and y variables

fig = plt.figure()
sns.lineplot(data=df, x="year", y="passengers")
plt.ylabel('Passengers')
plt.show()
Figure 1: Annual number of flight passenger

However, looking at Figure 2, we notice that the confidence level is aslo plotted thought we did not specify them. The reason is that each year has twelve records of monthly number of passengers, which when plotted are also estimated to show the variability of passenger on that year. So if we want to draw only the line, we may filter a specific year, for this case I only want to plot passengers for July during the same period.

First, we need to filter the dataset to July and assign a new dataset as df_july. The chunk below shows a line of code that filter passenger between 1949 to 1960 for July only.

df_july = df[df.month == 'Jul']

Then plot.

fig = plt.figure()
sns.lineplot(data=df_july, x="year", y="passengers")
plt.ylabel('Passengers')
plt.show()
Figure 2: Number of flight passenger for July during the period

Because I still learning how to deal with time in python, I simply switch to R as quarto allows me to swap code for the two languages within the same environment. I switch to R code and then convert the panda data frame to tibble using the reticulate package (Ushey, Allaire, and Tang 2020). The package allows us to convert pandas dataframe to tibble using a py function within R chunk;

tb = reticulate::py$df

Then within R chunk, we can add a new column date contain date for each month

tb = tb |>
  dplyr::mutate(date = 
seq(lubridate::my(011949), lubridate::my(121960), by = "month"))

tb |> head()
  year month passengers       date
1 1949   Jan        112 1949-01-01
2 1949   Feb        118 1949-02-01
3 1949   Mar        132 1949-03-01
4 1949   Apr        129 1949-04-01
5 1949   May        121 1949-05-01
6 1949   Jun        135 1949-06-01

The printed tibble has one added variable date. Since we have created this dataset, we can convert it back to python pandas dataframe by simply using r. funtion within the python chunk;

pdf = r.tb
pdf
       year month  passengers        date
0    1949.0   Jan       112.0  1949-01-01
1    1949.0   Feb       118.0  1949-02-01
2    1949.0   Mar       132.0  1949-03-01
3    1949.0   Apr       129.0  1949-04-01
4    1949.0   May       121.0  1949-05-01
..      ...   ...         ...         ...
139  1960.0   Aug       606.0  1960-08-01
140  1960.0   Sep       508.0  1960-09-01
141  1960.0   Oct       461.0  1960-10-01
142  1960.0   Nov       390.0  1960-11-01
143  1960.0   Dec       432.0  1960-12-01

[144 rows x 4 columns]

Then we can now plot a line and use the new column date we just created instead of year.

fig = plt.figure()
sns.lineplot(data=pdf, x="date", y="passengers")
plt.ylabel('Passengers')
plt.show()
Figure 3: Monthly Number of flight passengers

Figure 3 and Figure 2 are almost similar but while Figure 3 used year as x-axis, Figure 3 used date in the x-axis. You can see that Figure 3 clearly shows the variability of passenger within the year, which was not possible with Figure 2.

We can assign a grouping semantic (hue, size, or style) to plot separate lines

fig = plt.figure()
sns.lineplot(data=df, x="year", y="passengers", hue = "month")
plt.ylabel("Passengers")
plt.show()
Figure 4: Number of flight passenger by month during the period

Similarly, we can assign multiple semanti variables of the same variable that makes the plot more appealing ore easily to distinguish between the assigned parameters.

fig = plt.figure()
sns.lineplot(data=df, x="year", y="passengers", hue = "month", style = "month")
plt.ylabel("Passengers")
plt.show()
Figure 5: Number of flight passenger by month during the period

Cited Materials

Ushey, Kevin, JJ Allaire, and Yuan Tang. 2020. Reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.