Exploratory Data Analysis¶

E-Commerce Data Pipeline

In this notebook we will first execute all the components of our ELT pipeline and then plot visualizations with the resulting information.

In [6]:
# Importing the necessary libraries
from pandas import DataFrame
from sqlalchemy import create_engine
from typing import Dict
from pathlib import Path

from src.transform import QueryEnum
from src import config
from src.transform import run_queries
from src.extract import extract
from src.load import load
from src.plots import (
    plot_freight_value_weight_relationship,
    plot_global_amount_order_status,
    plot_real_vs_predicted_delivered_time,
    plot_revenue_by_month_year,
    plot_revenue_per_state,
    plot_top_10_least_revenue_categories,
    plot_top_10_revenue_categories,
    plot_top_10_revenue_categories_ammount,
    plot_delivery_date_difference,
    plot_order_amount_per_day_with_holidays,
)
In [7]:
# Create the database sql file
Path(config.SQLITE_BD_ABSOLUTE_PATH).touch()

# Create the database connection
ENGINE = create_engine(rf"sqlite:///{config.SQLITE_BD_ABSOLUTE_PATH}", echo=False)

1. Extract¶

Here you will be extracting the information from different sources.

In [8]:
csv_folder = config.DATASET_ROOT_PATH
public_holidays_url = config.PUBLIC_HOLIDAYS_URL

# 1. Get the mapping of the csv files to the table names.
csv_table_mapping = config.get_csv_to_table_mapping()

# 2. Extract the data from the csv files, holidays and load them into the dataframes.
csv_dataframes = extract(csv_folder, csv_table_mapping, public_holidays_url)

2. Load¶

Using the diferent datasets provided to you, you will store everything into our Data Warehouse.

In [5]:
load(data_frames=csv_dataframes, database=ENGINE)

3. Transform¶

Here you will be transforming the different queries into tables. There are two sections, one with the revenue related queries, and another with the delivery related queries.

3.1 Revenue queries¶
In [9]:
query_results: Dict[str, DataFrame] = run_queries(database=ENGINE)
In [10]:
# Transforming the revenue_by_month_year query to a table
revenue_by_month_year = query_results[QueryEnum.REVENUE_BY_MONTH_YEAR.value]
revenue_by_month_year
Out[10]:
month_no month Year2016 Year2017 Year2018
0 01 Jan 0.00 37632.57 969967.80
1 02 Feb 0.00 222270.75 853616.82
2 03 Mar 0.00 376833.72 1024851.95
3 04 Apr 0.00 299798.45 1274742.18
4 05 May 0.00 579280.43 1150528.93
5 06 Jun 0.00 489463.42 1141543.85
6 07 Jul 0.00 518115.19 925958.79
7 08 Aug 0.00 609180.34 1319737.66
8 09 Sep 0.00 652576.48 12875.18
9 10 Oct 34116.28 740570.40 347.95
10 11 Nov 10734.64 733047.33 0.00
11 12 Dec 960.85 1082600.69 0.00
In [11]:
# Transforming the top_10_revenue_categories query to a table
top_10_revenue_categories = query_results[QueryEnum.TOP_10_REVENUE_CATEGORIES.value]
top_10_revenue_categories
Out[11]:
Category Num_order Revenue
0 bed_bath_table 9272 1692714.28
1 health_beauty 8646 1620684.04
2 computers_accessories 6529 1549252.47
3 furniture_decor 6307 1394466.93
4 watches_gifts 5493 1387046.31
5 sports_leisure 7529 1349252.93
6 housewares 5743 1069787.97
7 auto 3809 833610.84
8 garden_tools 3448 810614.93
9 cool_stuff 3559 744649.32
In [12]:
# Transforming the top_10_least_revenue_categories query to a table
top_10_least_revenue_categories = query_results[
    QueryEnum.TOP_10_LEAST_REVENUE_CATEGORIES.value
]
top_10_least_revenue_categories
Out[12]:
Category Num_order Revenue
0 security_and_services 2 324.51
1 fashion_childrens_clothes 7 718.98
2 cds_dvds_musicals 12 1199.43
3 home_comfort_2 24 1710.54
4 flowers 29 2213.01
5 arts_and_craftmanship 23 2326.17
6 la_cuisine 13 2913.53
7 fashion_sport 26 3605.83
8 diapers_and_hygiene 25 4126.17
9 fashio_female_clothing 36 4902.67
In [13]:
# Transforming the revenue_per_state query to a table
revenue_per_state = query_results[QueryEnum.REVENUE_PER_STATE.value]
revenue_per_state
Out[13]:
customer_state Revenue
0 SP 5769081.27
1 RJ 2055690.45
2 MG 1819277.61
3 RS 861608.40
4 PR 781919.55
5 SC 595208.40
6 BA 591270.60
7 DF 346146.17
8 GO 334294.22
9 ES 317682.65
3.2 Delivery queries¶
In [14]:
# Transforming the delivery_date_difference query to a table
delivery_date_difference = query_results[QueryEnum.DELIVERY_DATE_DIFFERECE.value]
delivery_date_difference
Out[14]:
State Delivery_Difference
0 AL 8
1 MA 9
2 BA 10
3 CE 10
4 ES 10
5 SE 10
6 MS 11
7 PI 11
8 RJ 11
9 SC 11
10 SP 11
11 DF 12
12 GO 12
13 TO 12
14 MG 13
15 PB 13
16 PE 13
17 PR 13
18 RN 13
19 RS 13
20 MT 14
21 PA 14
22 RR 17
23 AM 19
24 AP 19
25 AC 20
26 RO 20
In [15]:
# Transforming the real vs predicted delivered time query to a table per month and year
real_vs_estimated_delivered_time = query_results[
    QueryEnum.REAL_VS_ESTIMATED_DELIVERED_TIME.value
]
real_vs_estimated_delivered_time
Out[15]:
month_no month Year2016_real_time Year2017_real_time Year2018_real_time Year2016_estimated_time Year2017_estimated_time Year2018_estimated_time
0 01 Jan NaN 12.647044 14.080563 NaN 39.508831 26.302555
1 02 Feb NaN 13.168825 16.945138 NaN 31.848929 24.528996
2 03 Mar NaN 12.951184 16.301410 NaN 24.732386 22.033067
3 04 Apr NaN 14.917913 11.499241 NaN 27.349810 23.679546
4 05 May NaN 11.322363 11.419029 NaN 24.284784 22.891488
5 06 Jun NaN 12.011573 9.239216 NaN 24.021864 27.772740
6 07 Jul NaN 11.592732 8.958636 NaN 23.317316 19.690127
7 08 Aug NaN 11.147125 7.732780 NaN 23.480603 15.186330
8 09 Sep 54.813194 11.851007 NaN 18.488449 22.437366 NaN
9 10 Oct 19.600559 11.856700 NaN 55.676632 23.019536 NaN
10 11 Nov NaN 15.161132 NaN NaN 22.560752 NaN
11 12 Dec 4.693021 15.394919 NaN 26.030012 27.681340 NaN
In [16]:
# Transforming the global ammount of order status query to a table
global_ammount_order_status = query_results[QueryEnum.GLOBAL_AMMOUNT_ORDER_STATUS.value]
global_ammount_order_status
Out[16]:
order_status Ammount
0 approved 2
1 canceled 625
2 created 5
3 delivered 96478
4 invoiced 314
5 processing 301
6 shipped 1107
7 unavailable 609
In [17]:
orders_per_day_and_holidays = query_results[
    QueryEnum.ORDERS_PER_DAY_AND_HOLIDAYS_2017.value
]
orders_per_day_and_holidays
Out[17]:
order_count date holiday
0 32 2017-01-05 False
1 4 2017-01-06 False
2 4 2017-01-07 False
3 6 2017-01-08 False
4 5 2017-01-09 False
... ... ... ...
356 167 2017-12-27 False
357 146 2017-12-28 False
358 135 2017-12-29 False
359 97 2017-12-30 False
360 74 2017-12-31 False

361 rows × 3 columns

In [18]:
freight_value_weight_relationship = query_results[
    QueryEnum.GET_FREIGHT_VALUE_WEIGHT_RELATIONSHIP.value
]
freight_value_weight_relationship.head()
Out[18]:
order_id product_weight_g freight_value
0 00010242fe8c5a6d1ba2dd792cb16214 650.0 13.29
1 00018f77f2f0320c557190d7a144bdd3 30000.0 19.93
2 000229ec398224ef6ca0657da4fc703e 3050.0 17.87
3 00024acbcdf0a6daa1e931b038114c75 200.0 12.79
4 00042b26cf59d7ce69dfabb4e55b4fd9 3750.0 18.14

4. Plotting the query tables¶

Now, we will create plots from the different queries

In [19]:
plot_revenue_by_month_year(revenue_by_month_year, 2017)
In [20]:
plot_top_10_revenue_categories(top_10_revenue_categories)
In [21]:
plot_top_10_least_revenue_categories(top_10_least_revenue_categories)
In [22]:
plot_revenue_per_state(revenue_per_state)
In [23]:
plot_delivery_date_difference(delivery_date_difference)
In [24]:
plot_real_vs_predicted_delivered_time(real_vs_estimated_delivered_time, year=2017)
In [25]:
plot_global_amount_order_status(global_ammount_order_status)
In [26]:
plot_order_amount_per_day_with_holidays(orders_per_day_and_holidays)

The Orders per Day and Holidays plot you will code in src/plots.py has to be similar to this:

The plot you will code will be similar to this:

Relationship between the Brazilian public holidays of 2017 and that year's purchases

In [27]:
plot_freight_value_weight_relationship(freight_value_weight_relationship)

The Freight Value and Weight Relationship plot you will code in src/plots.py has to be similar to this:

The plot you will code will be similar to this:

Correlation between the freight value and the product weight

In [28]:
corr = freight_value_weight_relationship['freight_value'].corr(freight_value_weight_relationship['product_weight_g'])

print("The linear correlation between freight_value and product_weight_g is:", corr)
The linear correlation between freight_value and product_weight_g is: 0.6400928438800259

We can see a positive correlation between this two variables but not extremely strong