Analysis of Purchases and Profits at a Superstore

Brian Wang

Introduction:

This analysis utilizes the Superstore Dataset from Kaggle, which can be accessed from this link: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final. It can also be downloaded directly from this link: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final/download. The objective of this analysis is to uncover trends in profits and losses, discover areas of high sales and high profits, and determine the most and least profitable items.

In this tutorial, I use the following libraries to help process the data:
pandas
numpy
datetime
re
scipy

I also use the following libraries to help visualize the data:
seaborn
matplotlib
follium
pgeocode

And finally, I use these libraries to get supporting data:
BeautifulSoup
requests

This code snippet installs pgeocode, which is a library that helps convert postal codes to geographic coordinates.

In [1]:
debug = False # prints debug output if True
In [2]:
import sys
!{sys.executable} -m pip install pgeocode # install pgeocode using pip, since this environment doesn't have it by default
Requirement already satisfied: pgeocode in /opt/conda/lib/python3.9/site-packages (0.3.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.9/site-packages (from pgeocode) (2.27.1)
Requirement already satisfied: pandas in /opt/conda/lib/python3.9/site-packages (from pgeocode) (1.4.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from pgeocode) (1.21.5)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.9/site-packages (from pandas->pgeocode) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.9/site-packages (from pandas->pgeocode) (2021.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests->pgeocode) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.9/site-packages (from requests->pgeocode) (2.0.10)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests->pgeocode) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests->pgeocode) (1.26.8)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->pgeocode) (1.16.0)
In [3]:
import pandas as pd
import numpy as np
import folium
import seaborn as sns
import matplotlib as plt
from matplotlib import pyplot
from datetime import datetime
import pgeocode
import requests
from bs4 import BeautifulSoup
import re
import scipy

In the code block above, I imported all the libraries that I will need. In the code block below, I read the csv file into a pandas dataframe.

Data collection and Parsing:
We start by downloading the data in the form of a CSV (Comma separated value) file. Other formats work too, but they require different code to read into python. We place this file in the same directory as our code.

In [4]:
data = pd.read_csv('superstore.csv', sep = ',', encoding='windows-1252')
data.head()
Out[4]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
0 1 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600 2 0.00 41.9136
1 2 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400 3 0.00 219.5820
2 3 CA-2016-138688 6/12/2016 6/16/2016 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200 2 0.00 6.8714
3 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775 5 0.45 -383.0310
4 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680 2 0.20 2.5164

5 rows × 21 columns

Data management + Representation:
The next time is to process the data. This means handling missing values and converting some datatypes into more useful ones. Here, I will be simply ignoring any row with a NA value, since there isn't really a good default value for any of the columns. In this case, there aren't any missing values. I will also convert the order and ship dates to datetime objects, so they can be sorted if we want to perform an analysis based on that.

In [5]:
data.dropna(inplace = True)
In [6]:
def to_datetime(row, target_col):
    dt = datetime.strptime(row[target_col], '%m/%d/%Y') # expected format
    return dt
data['Order Datetime'] = data.apply(lambda row: to_datetime(row, 'Order Date'), axis=1)
data['Ship Datetime'] = data.apply(lambda row: to_datetime(row, 'Ship Date'), axis=1)
data.drop(['Order Date', 'Ship Date'], axis = 1, inplace = True)
data.head()
Out[6]:
Row ID Order ID Ship Mode Customer ID Customer Name Segment Country City State Postal Code ... Product ID Category Sub-Category Product Name Sales Quantity Discount Profit Order Datetime Ship Datetime
0 1 CA-2016-152156 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420 ... FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600 2 0.00 41.9136 2016-11-08 2016-11-11
1 2 CA-2016-152156 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420 ... FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400 3 0.00 219.5820 2016-11-08 2016-11-11
2 3 CA-2016-138688 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles California 90036 ... OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200 2 0.00 6.8714 2016-06-12 2016-06-16
3 4 US-2015-108966 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311 ... FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775 5 0.45 -383.0310 2015-10-11 2015-10-18
4 5 US-2015-108966 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311 ... OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680 2 0.20 2.5164 2015-10-11 2015-10-18

5 rows × 21 columns

After creating the datetime objects, we can remove the original strings, since they are no longer needed. I also think it may be good to have the items base price, discounted price, and profit per item sold, so I'll make columns for that and calculate based on sales, quantity, and discount.

In [7]:
def get_base_price(row):
    return row['Sales'] / (1 - row['Discount']) / row['Quantity'] # if there were ever 100% discounts, we would have a problem here, but there aren't

def get_sale_price(row):
    return row['Sales'] / row['Quantity']

def get_per_item_profit(row):
    return row['Profit'] / row['Quantity']

data['Base Price'] = data.apply(lambda row: get_base_price(row), axis=1)
data['Sale Price'] = data.apply(lambda row: get_sale_price(row), axis=1)
data['Profit Per Item'] = data.apply(lambda row: get_per_item_profit(row), axis=1)

data.sort_values(by = 'Profit Per Item', inplace = True, ascending = False) # we sort for the highest profit per item

data
Out[7]:
Row ID Order ID Ship Mode Customer ID Customer Name Segment Country City State Postal Code ... Product Name Sales Quantity Discount Profit Order Datetime Ship Datetime Base Price Sale Price Profit Per Item
6826 6827 CA-2016-118689 Standard Class TC-20980 Tamara Chand Corporate United States Lafayette Indiana 47905 ... Canon imageCLASS 2200 Advanced Copier 17499.950 5 0.0 8399.9760 2016-10-02 2016-10-09 3499.99 3499.990 1679.9952
4190 4191 CA-2017-166709 Standard Class HL-15040 Hunter Lopez Consumer United States Newark Delaware 19711 ... Canon imageCLASS 2200 Advanced Copier 10499.970 3 0.0 5039.9856 2017-11-17 2017-11-22 3499.99 3499.990 1679.9952
8153 8154 CA-2017-140151 First Class RB-19360 Raymond Buch Consumer United States Seattle Washington 98115 ... Canon imageCLASS 2200 Advanced Copier 13999.960 4 0.0 6719.9808 2017-03-23 2017-03-25 3499.99 3499.990 1679.9952
318 319 CA-2014-164973 Standard Class NM-18445 Nathan Mautz Home Office United States New York City New York 10024 ... Canon imageCLASS MF7460 Monochrome Digital Las... 3991.980 2 0.0 1995.9900 2014-11-04 2014-11-09 1995.99 1995.990 997.9950
2623 2624 CA-2017-127180 First Class TA-21385 Tom Ashbrook Home Office United States New York City New York 10024 ... Canon imageCLASS 2200 Advanced Copier 11199.968 4 0.2 3919.9888 2017-10-22 2017-10-24 3499.99 2799.992 979.9972
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4991 4992 US-2017-122714 Standard Class HG-14965 Henry Goldwyn Corporate United States Chicago Illinois 60653 ... Ibico EPK-21 Electric Binding System 1889.990 5 0.8 -2929.4845 2017-12-07 2017-12-13 1889.99 377.998 -585.8969
3011 3012 CA-2017-134845 Standard Class SR-20425 Sharelle Roach Home Office United States Louisville Colorado 80027 ... Lexmark MX611dhe Monochrome Laser Printer 2549.985 5 0.7 -3399.9800 2017-04-17 2017-04-23 1699.99 509.997 -679.9960
683 684 US-2017-168116 Same Day GT-14635 Grant Thornton Corporate United States Burlington North Carolina 27217 ... Cubify CubeX 3D Printer Triple Head Print 7999.980 4 0.5 -3839.9904 2017-11-04 2017-11-04 3999.99 1999.995 -959.9976
7772 7773 CA-2016-108196 Standard Class CS-12505 Cindy Stewart Consumer United States Lancaster Ohio 43130 ... Cubify CubeX 3D Printer Double Head Print 4499.985 5 0.7 -6599.9780 2016-11-25 2016-12-02 2999.99 899.997 -1319.9956
3151 3152 CA-2015-147830 First Class NF-18385 Natalie Fritzler Consumer United States Newark Ohio 43055 ... Cubify CubeX 3D Printer Double Head Print 1799.994 2 0.7 -2639.9912 2015-12-15 2015-12-18 2999.99 899.997 -1319.9956

9994 rows × 24 columns

Exploratory Data Analysis:
We can now explore the data to find insights. I will start by finding all instances of negative profit, and look for commonalities in the entries.

In [8]:
neg_profit = data[data['Profit'] < 0] # select entries where profit is negative
neg_profit = neg_profit.sort_values(by = 'Profit', ascending = True) # display the biggest losses first
neg_profit
Out[8]:
Row ID Order ID Ship Mode Customer ID Customer Name Segment Country City State Postal Code ... Product Name Sales Quantity Discount Profit Order Datetime Ship Datetime Base Price Sale Price Profit Per Item
7772 7773 CA-2016-108196 Standard Class CS-12505 Cindy Stewart Consumer United States Lancaster Ohio 43130 ... Cubify CubeX 3D Printer Double Head Print 4499.985 5 0.7 -6599.9780 2016-11-25 2016-12-02 2999.99 899.997 -1319.9956
683 684 US-2017-168116 Same Day GT-14635 Grant Thornton Corporate United States Burlington North Carolina 27217 ... Cubify CubeX 3D Printer Triple Head Print 7999.980 4 0.5 -3839.9904 2017-11-04 2017-11-04 3999.99 1999.995 -959.9976
9774 9775 CA-2014-169019 Standard Class LF-17185 Luke Foster Consumer United States San Antonio Texas 78207 ... GBC DocuBind P400 Electric Binding System 2177.584 8 0.8 -3701.8928 2014-07-26 2014-07-30 1360.99 272.198 -462.7366
3011 3012 CA-2017-134845 Standard Class SR-20425 Sharelle Roach Home Office United States Louisville Colorado 80027 ... Lexmark MX611dhe Monochrome Laser Printer 2549.985 5 0.7 -3399.9800 2017-04-17 2017-04-23 1699.99 509.997 -679.9960
4991 4992 US-2017-122714 Standard Class HG-14965 Henry Goldwyn Corporate United States Chicago Illinois 60653 ... Ibico EPK-21 Electric Binding System 1889.990 5 0.8 -2929.4845 2017-12-07 2017-12-13 1889.99 377.998 -585.8969
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4660 4661 CA-2015-112130 Standard Class SV-20785 Stewart Visinsky Consumer United States Philadelphia Pennsylvania 19140 ... SanDisk Cruzer 16 GB USB Flash Drive 27.552 3 0.2 -0.3444 2015-05-03 2015-05-07 11.48 9.184 -0.1148
7413 7414 CA-2017-121125 Standard Class MG-17890 Michael Granlund Home Office United States Tigard Oregon 97224 ... Tensor Brushed Steel Torchiere Floor Lamp 13.592 1 0.2 -0.3398 2017-05-30 2017-06-03 16.99 13.592 -0.3398
1566 1567 CA-2015-129112 First Class AW-10840 Anthony Witt Consumer United States Allen Texas 75002 ... Kingston Digital DataTraveler 16GB USB 2.0 21.480 3 0.2 -0.2685 2015-11-29 2015-11-30 8.95 7.160 -0.0895
1496 1497 CA-2017-152485 Standard Class JD-15790 John Dryer Consumer United States Coppell Texas 75019 ... Acco Perma 3000 Stacking Storage Drawers 16.784 1 0.2 -0.2098 2017-09-04 2017-09-08 20.98 16.784 -0.2098
3228 3229 CA-2014-108189 First Class ES-14080 Erin Smith Corporate United States Tempe Arizona 85281 ... Kingston Digital DataTraveler 16GB USB 2.0 7.160 1 0.2 -0.0895 2014-10-02 2014-10-05 8.95 7.160 -0.0895

1871 rows × 24 columns

We can see that many of the biggest losses have multiple items sold, which makes sense. To control for this, we can sort by loss per item.

In [9]:
neg_profit = neg_profit.sort_values(by = 'Profit Per Item', ascending = True) # display the biggest losses first
neg_profit
Out[9]:
Row ID Order ID Ship Mode Customer ID Customer Name Segment Country City State Postal Code ... Product Name Sales Quantity Discount Profit Order Datetime Ship Datetime Base Price Sale Price Profit Per Item
7772 7773 CA-2016-108196 Standard Class CS-12505 Cindy Stewart Consumer United States Lancaster Ohio 43130 ... Cubify CubeX 3D Printer Double Head Print 4499.985 5 0.7 -6599.9780 2016-11-25 2016-12-02 2999.99 899.997 -1319.9956
3151 3152 CA-2015-147830 First Class NF-18385 Natalie Fritzler Consumer United States Newark Ohio 43055 ... Cubify CubeX 3D Printer Double Head Print 1799.994 2 0.7 -2639.9912 2015-12-15 2015-12-18 2999.99 899.997 -1319.9956
683 684 US-2017-168116 Same Day GT-14635 Grant Thornton Corporate United States Burlington North Carolina 27217 ... Cubify CubeX 3D Printer Triple Head Print 7999.980 4 0.5 -3839.9904 2017-11-04 2017-11-04 3999.99 1999.995 -959.9976
3011 3012 CA-2017-134845 Standard Class SR-20425 Sharelle Roach Home Office United States Louisville Colorado 80027 ... Lexmark MX611dhe Monochrome Laser Printer 2549.985 5 0.7 -3399.9800 2017-04-17 2017-04-23 1699.99 509.997 -679.9960
4991 4992 US-2017-122714 Standard Class HG-14965 Henry Goldwyn Corporate United States Chicago Illinois 60653 ... Ibico EPK-21 Electric Binding System 1889.990 5 0.8 -2929.4845 2017-12-07 2017-12-13 1889.99 377.998 -585.8969
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4660 4661 CA-2015-112130 Standard Class SV-20785 Stewart Visinsky Consumer United States Philadelphia Pennsylvania 19140 ... SanDisk Cruzer 16 GB USB Flash Drive 27.552 3 0.2 -0.3444 2015-05-03 2015-05-07 11.48 9.184 -0.1148
1566 1567 CA-2015-129112 First Class AW-10840 Anthony Witt Consumer United States Allen Texas 75002 ... Kingston Digital DataTraveler 16GB USB 2.0 21.480 3 0.2 -0.2685 2015-11-29 2015-11-30 8.95 7.160 -0.0895
8881 8882 CA-2016-135594 Second Class AH-10120 Adrian Hane Home Office United States Aurora Illinois 60505 ... Kingston Digital DataTraveler 16GB USB 2.0 50.120 7 0.2 -0.6265 2016-06-30 2016-07-03 8.95 7.160 -0.0895
2769 2770 US-2015-122140 Standard Class MO-17950 Michael Oakman Consumer United States Dallas Texas 75220 ... Kingston Digital DataTraveler 16GB USB 2.0 50.120 7 0.2 -0.6265 2015-04-02 2015-04-07 8.95 7.160 -0.0895
3228 3229 CA-2014-108189 First Class ES-14080 Erin Smith Corporate United States Tempe Arizona 85281 ... Kingston Digital DataTraveler 16GB USB 2.0 7.160 1 0.2 -0.0895 2014-10-02 2014-10-05 8.95 7.160 -0.0895

1871 rows × 24 columns

Looking at this, I noticed that the highest and lowest entries all have a non zero discount. I want to check further and see if any non discounted items sold at a loss, so I filter the table again.

In [10]:
neg_profit2 = neg_profit[neg_profit['Discount'] == 0] # filter for discount = 0
neg_profit2
Out[10]:
Row ID Order ID Ship Mode Customer ID Customer Name Segment Country City State Postal Code ... Product Name Sales Quantity Discount Profit Order Datetime Ship Datetime Base Price Sale Price Profit Per Item

0 rows × 24 columns

We found out that the profit is always positive when there is no discount, which makes sense. Next, I want to look at factors that may affect profits. Since profits per item makes the most sense to look at, I will be graphing the variables against profits per item rather than overall profits. We can use matplotlib and seaborn for this, since they make it very easy to graph different variables against each other.

In [11]:
fig, ax = plt.pyplot.subplots()

sns.violinplot(x = 'Ship Mode', y = 'Profit Per Item', data=data, ax = ax)
ax.set_xlabel("Ship Mode")
ax.set_ylabel("Profit Per Item")
Out[11]:
Text(0, 0.5, 'Profit Per Item')

This graph looks a bit squished due to outliers. From this, we can conclude that the vast majority of items across all classes have very little profit or loss. We can also see that the median for each mode of shipping is around the same, so it doesn't seem that shipping mode affects profits much. To get a closer look at the center, we can remove outliers and see what the graph looks like.

In [12]:
no_outliers = data[data['Profit Per Item'] < 500]
no_outliers = no_outliers[no_outliers['Profit Per Item'] > -500]

fig, ax = plt.pyplot.subplots()

sns.violinplot(x = 'Ship Mode', y = 'Profit Per Item', data=no_outliers, ax = ax)
ax.set_xlabel('Ship Mode')
ax.set_ylabel('Profit Per Item')
Out[12]:
Text(0, 0.5, 'Profit Per Item')
In [13]:
data[data['Ship Mode'] == 'Standard Class'].hist(column = 'Profit Per Item', bins = 100)
Out[13]:
array([[<AxesSubplot:title={'center':'Profit Per Item'}>]], dtype=object)
In [14]:
data[data['Ship Mode'] == 'First Class'].hist(column = 'Profit Per Item', bins = 100)
Out[14]:
array([[<AxesSubplot:title={'center':'Profit Per Item'}>]], dtype=object)
In [15]:
data[data['Ship Mode'] == 'Second Class'].hist(column = 'Profit Per Item', bins = 100)
Out[15]:
array([[<AxesSubplot:title={'center':'Profit Per Item'}>]], dtype=object)
In [16]:
data[data['Ship Mode'] == 'Same Day'].hist(column = 'Profit Per Item', bins = 100)
Out[16]:
array([[<AxesSubplot:title={'center':'Profit Per Item'}>]], dtype=object)

From these graphs, it looks like shipping mode doesn't have much of an impact on profit per item, and the vast majority of profits are slightly above 0, which agrees with the violin plots we made earlier.

It may also be interesting to look at net profits and quantity by some variables. Since we can't average names, and it's inconvenient and unnecessary to average dates, we can drop them and anything else that we don't need.

In [17]:
pruned_data = data.drop(['Row ID', 'Base Price', 'Sale Price', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Order Datetime', 'Ship Datetime', 'Order ID', 'Ship Mode', 'Customer Name', 'Customer ID', 'Segment', 'Country', 'City', 'State', 'Postal Code'], axis = 1)
grouped_data = pruned_data.groupby('Product Name', as_index = False).agg({'Sales':'sum', 'Quantity':'sum','Profit':'sum', 'Discount':'mean', 'Profit Per Item':'mean'}) # we separate the dataframe into groups by product name and sum the results, but take the mean discount
grouped_data['Profit Per Item'] = grouped_data.apply(lambda row: get_per_item_profit(row), axis=1) # update profit per item since adding it together doesnt work
grouped_data['Profit Percent'] = grouped_data.apply(lambda row: row['Profit'] / row['Sales'] * 100, axis=1)
grouped_data.sort_values(by = 'Profit Percent', inplace = True, ascending = False)

grouped_data
Out[17]:
Product Name Sales Quantity Profit Discount Profit Per Item Profit Percent
109 Adams Telephone Message Book w/Frequently-Call... 223.440 28 111.7200 0.0 3.990000 50.000000
405 Canon imageCLASS MF7460 Monochrome Digital Las... 3991.980 2 1995.9900 0.0 997.995000 50.000000
181 Avery 475 266.400 18 133.2000 0.0 7.400000 50.000000
1673 Xerox 1890 244.700 5 122.3500 0.0 24.470000 50.000000
1567 Tops Green Bar Computer Printout Paper 342.580 7 171.2900 0.0 24.470000 50.000000
... ... ... ... ... ... ... ...
1838 Zebra GK420t Direct Thermal/Thermal Transfer P... 703.710 6 -938.2800 0.7 -156.380000 -133.333333
1215 Okidata B401 Printer 179.991 3 -251.9874 0.7 -83.995800 -140.000000
613 Euro Pro Shark Stick Mini Vacuum 170.744 11 -325.6332 0.6 -29.603018 -190.714286
388 Bush Westfield Collection Bookcases, Dark Cher... 90.882 3 -190.8522 0.7 -63.617400 -210.000000
605 Eureka Disposable Bags for Sanitaire Vibra Gro... 1.624 2 -4.4660 0.8 -2.233000 -275.000000

1850 rows × 7 columns

By sorting by quantity, we can see that many of the most popular items are sold at profit, with a few exceptions.

Here, we get a bunch of stats by item. It is important to note that the discount is approximate, since we are averaging across transcations instead of quantity of items, but it is a decent approximation. We can immediately see that the highest profit margin is 50%, and most of the items with 50% profit margin are sold at 0 discount.

In [18]:
grouped_data.sort_values(by = 'Quantity', inplace = True, ascending = False)

grouped_data
Out[18]:
Product Name Sales Quantity Profit Discount Profit Per Item Profit Percent
1499 Staples 755.470 215 292.2046 0.100000 1.359091 38.678518
1493 Staple envelope 1686.812 170 751.6290 0.075000 4.421347 44.559145
537 Easy-staple paper 2504.192 150 1096.0292 0.060870 7.306861 43.767778
1500 Staples in misc. colors 478.812 86 124.0516 0.063158 1.442460 25.908206
942 KI Adjustable-Height Table 4552.641 74 -248.4822 0.236111 -3.357868 -5.457979
... ... ... ... ... ... ... ...
385 Bush Saratoga Collection 5-Shelf Bookcase, Han... 119.833 1 -12.6882 0.150000 -12.688200 -10.588235
749 Global Enterprise Series Seating Low-Back Swiv... 207.184 1 25.8980 0.200000 25.898000 12.500000
1253 Penpower WorldCard Pro Card Scanner 91.475 1 -1.8295 0.500000 -1.829500 -2.000000
356 Boston 1900 Electric Pencil Sharpener 14.980 1 4.4940 0.000000 4.494000 30.000000
1793 Xerox 20 6.480 1 3.1104 0.000000 3.110400 48.000000

1850 rows × 7 columns

Finally, we sort by profit to peep at the most and least profitable items.

In [19]:
grouped_data.sort_values(by = 'Profit', inplace = True, ascending = False)

grouped_data
Out[19]:
Product Name Sales Quantity Profit Discount Profit Per Item Profit Percent
404 Canon imageCLASS 2200 Advanced Copier 61599.824 20 25199.9280 0.120000 1259.996400 40.909091
650 Fellowes PB500 Electric Punch Plastic Comb Bin... 27453.384 31 7753.0390 0.240000 250.098032 28.240741
805 Hewlett Packard LaserJet 3310 Copier 18839.686 38 6983.8836 0.200000 183.786411 37.070064
400 Canon PC1060 Personal Laser Copier 11619.834 19 4570.9347 0.150000 240.575511 39.337349
787 HP Designjet T520 Inkjet Large Format Printer ... 18374.895 12 4094.9766 0.166667 341.248050 22.285714
... ... ... ... ... ... ... ...
376 Bush Advantage Collection Racetrack Conference... 9544.725 33 -1934.3976 0.350000 -58.618109 -20.266667
425 Chromcraft Bull-Nose Wood Oval Conference Tabl... 9917.640 27 -2876.1156 0.280000 -106.522800 -29.000000
476 Cubify CubeX 3D Printer Triple Head Print 7999.980 4 -3839.9904 0.500000 -959.997600 -48.000000
985 Lexmark MX611dhe Monochrome Laser Printer 16829.901 18 -4589.9730 0.400000 -254.998500 -27.272727
475 Cubify CubeX 3D Printer Double Head Print 11099.963 9 -8879.9704 0.533333 -986.663378 -80.000000

1850 rows × 7 columns

Next, I will try to visualize where sales are occuring. I will use pgeocode to convert the zip codes in the data to geographics coordinates, and folium to plot sales. Since the data is only in the United States (which will be shown in the code snippet below), I will zoom the map to the US at the start.

In [20]:
pruned_data = data.drop(['Row ID', 'Base Price', 'Sale Price', 'Product Name', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Order Datetime', 'Ship Datetime', 'Order ID', 'Ship Mode', 'Customer Name', 'Customer ID', 'Segment', 'Country', 'City', 'State'], axis = 1)
grouped_data2 = pruned_data.groupby('Postal Code', as_index = False).agg({'Sales':'sum', 'Quantity':'sum','Profit':'sum', 'Discount':'mean', 'Profit Per Item':'mean'}) # we separate the dataframe into groups by postal code
grouped_data2['Profit Per Item'] = grouped_data2.apply(lambda row: get_per_item_profit(row), axis=1) # update profit per item since adding it together doesnt work
grouped_data2['Profit Percent'] = grouped_data2.apply(lambda row: row['Profit'] / row['Sales'] * 100, axis=1)
grouped_data2.sort_values(by = 'Quantity', inplace = True, ascending = False)

grouped_data2
Out[20]:
Postal Code Sales Quantity Profit Discount Profit Per Item Profit Percent
55 10035 77357.885 953 16533.8669 0.047529 17.349283 21.373215
52 10009 54761.496 848 13697.0019 0.054585 16.152125 25.012103
54 10024 78697.182 840 21653.7248 0.064783 25.778244 27.515248
578 94122 52667.467 802 7712.5958 0.072660 9.616703 14.643947
53 10011 45551.598 776 10152.3901 0.059585 13.082977 22.287671
... ... ... ... ... ... ... ...
623 98208 3.856 1 1.3978 0.200000 1.397800 36.250000
462 80122 72.294 1 -98.8018 0.700000 -98.801800 -136.666667
63 11757 55.480 1 26.6304 0.000000 26.630400 48.000000
475 83501 9.584 1 3.3544 0.200000 3.354400 35.000000
612 98002 4.180 1 1.5048 0.000000 1.504800 36.000000

631 rows × 7 columns

In [21]:
print(data['Country'].unique()) # the only country in this datset is the US
['United States']
In [22]:
def sales_circle_size(sales): # two functions to help determine how large the plotted points will be based on the number of sales
    if sales <= 1:
        return 1
    return sales ** 0.5/ 10

def profit_circle_size(profit):
    if profit <= 1:
        return 1
    return profit ** 0.5 / 10
In [23]:
map_osm = folium.Map(location=[36.84, -90.3], zoom_start=4)

code_to_loc = {}
nomi = pgeocode.Nominatim('us') # use us postal codes
for code, row in grouped_data2.iterrows():
    loc = nomi.query_postal_code('{:05}'.format(int(row[0])))
    if(np.isnan(loc.latitude)):
        if(debug):
            print(loc)
        continue
    code_to_loc[int(row[0])] = [loc.latitude, loc.longitude]
    if row['Sales'] > 0:
        folium.CircleMarker(location = [loc.latitude, loc.longitude], radius = sales_circle_size(row['Sales']), weight = 0.5, color = "blue", fill = True).add_to(map_osm)
    elif debug:
        print("0 sales?")
    if row['Profit'] > 0:
        folium.CircleMarker(location = [loc.latitude, loc.longitude], radius = profit_circle_size(row['Profit']), weight = 2, color = "green", fill = True).add_to(map_osm)
    else:
        folium.CircleMarker(location = [loc.latitude, loc.longitude], radius = profit_circle_size(-1 * row['Profit']), weight = 1, color = "red", fill = True).add_to(map_osm)

map_osm
Out[23]:
Make this Notebook Trusted to load map: File -> Trust Notebook

In this map, blue circles represent sales, green circles represent profits, and red circles represent losses. All are proportional to the area of the circle. It looks like proximity to major port cities is a big factor on whether or not a store is profitable or not. Stores on the coasts seem to be more profitable than stores in the central US. However, there are many exceptions, so this likely doesn't account for everything. However, there may still be some correlation, so this is what we will examine more closely.

Hypothesis Testing </br> First, we have to decide which ports to count. I found a list of the top 10 US ports here: https://www.icontainers.com/us/2017/05/16/top-10-us-ports/, which is what I have decided to use, but the set of ports could be greater or less if you want. You could even just do proximity to water. The latitude and longitude of all the ports in the US can be found here: https://www.gccports.com/ports/latitude-longitude/search. My hypothesis will be that there is a correlation between distance from the top 10 ports and profit per item, and the null hypothesis will be that there is no correlation.

In [24]:
full_port_set = False # toggle for using full set of available US ports
In [25]:
def parse_coord(s):
    s = strip_td(s)
    coords = re.split('°|\'', s)
    val = float(coords[0]) + float(coords[1]) / 60.0 # process degrees and minutes
    if coords[2] == 'W' or coords[2] == 'S':
        val = val * -1
    return val

def strip_td(s):
    s = str(s)
    s = s[4:len(s) - 5]
    return s

target_ports = ['Miami', 'Oakland', 'Charleston', 'Georgetown', 'Houston', 'Norfolk', 'Seattle', 'Tacoma', 'Savannah', 'Brunswick', 'Newark', 'Brooklyn', 'Long Beach', 'Los Angeles']
target_coords = []
port_data = {}
for i in range(1, 10): # 9 pages
    # this websites SSL certificate is expired so we have to disable verification if we don't want to do it manually
    res = requests.get('https://www.gccports.com/ports/latitude-longitude/search/?n=&c=231&cd=&per_page=' + str(i), verify = False)
    root = BeautifulSoup(res.content, 'html.parser') # parse html using beautiful soup
    table = root.find('table') # find the table of ports
    ports = table.findAll('td') # find all the rows containing data
    for i in range(0, len(ports))[::5]: # row are 5 elements long, position 0 contains name, 3 and 4 contains latitude
        port_data[strip_td(ports[i])] = [parse_coord(ports[i + 3]), parse_coord(ports[i + 4])]
if debug:
    print(port_data)
for port in (port_data.keys() if full_port_set else target_ports):
    if port in port_data.keys():
        target_coords.append(port_data[port])
    else:
        print('Could not find port ' + port)
if debug:
    print(target_coords)
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.gccports.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Could not find port Oakland
Could not find port Georgetown
Could not find port Brunswick
Could not find port Newark
Could not find port Brooklyn

A few of the ports mentioned in the article aren't here, but that is ok! We can now add a column for the minimum distance to a port

In [26]:
def distance(a, b):
    return ((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2) ** 0.5

def best_distance(loc):
    best_distance = distance(loc, target_coords[0])
    best_port = target_coords[0]
    for target in target_coords:
        dis = distance(loc, target)
        if dis < best_distance:
            best_distance = dis
            best_port = target
    return best_distance

grouped_data2['Port Distance'] = grouped_data2.apply(lambda row: best_distance(code_to_loc[int(row[0])]), axis=1)
grouped_data2.sort_values(by = 'Port Distance', inplace = True)
grouped_data2
Out[26]:
Postal Code Sales Quantity Profit Discount Profit Per Item Profit Percent Port Distance
620 98105 41838.008 625 8734.6042 0.069091 13.975367 20.877199 0.070542
619 98103 36541.826 539 7117.6170 0.058278 13.205226 19.478000 0.073883
166 33142 3619.421 38 -980.0357 0.381818 -25.790413 -27.077140 0.080095
621 98115 41160.908 426 13303.8755 0.067857 31.229755 32.321628 0.092427
165 33134 64.248 6 22.3578 0.200000 3.726300 34.799216 0.104742
... ... ... ... ... ... ... ... ...
315 57401 25.500 3 6.6300 0.000000 2.210000 26.000000 16.029117
316 57701 152.640 4 44.7684 0.000000 11.192100 29.329403 16.403706
317 58103 919.910 30 230.1497 0.000000 7.671657 25.018719 17.170214
313 56560 543.860 9 101.4106 0.000000 11.267844 18.646453 17.176817
299 54880 1299.730 37 514.0822 0.000000 13.894114 39.552999 17.258858

631 rows × 8 columns

We can then create a trendline and graph using numpy, and perform linear regression with scipy.

In [27]:
m = np.polyfit(grouped_data2["Port Distance"], grouped_data2["Profit Per Item"], 1)
p = np.poly1d(m)
grouped_data2["trendline"] = p(grouped_data2.loc[:, "Port Distance"])

ax = grouped_data2.plot(x = "Port Distance", y = "Profit Per Item", kind = "scatter")
dfm2 = grouped_data2.set_index("Port Distance", inplace=False)
dfm2.trendline.sort_index(ascending=False).plot(ax=ax)
Out[27]:
<AxesSubplot:xlabel='Port Distance', ylabel='Profit Per Item'>
In [28]:
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(grouped_data2['Port Distance'], grouped_data2['Profit Per Item'])
p_value
Out[28]:
0.28374378416163254
In [29]:
slope
Out[29]:
0.20621553633367354

We see a slight positive correlation between profit per item, but our P value is 0.28, which is less than the standard alpha value of 0.05, so we fail to reject our null hypothesis. However, it's still a fairly low value, so there could be something here with greater sample size. This could also be because I'm not using a good set of ports. It's interesting to see because I thought increased distance would increase shipping costs, and reduce profits, but this doesn't seem to be the case.

In [30]:
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(grouped_data2['Port Distance'], grouped_data2['Sales'])
p_value
Out[30]:
0.0010136492863376115
In [31]:
slope
Out[31]:
-246.3898496821685

We can see here that there is a strong negative correlation between distance from the nearest large port and sales. This could be because people tend to live near ports. We can take a look at the graph here:

In [32]:
ax = grouped_data2.plot(x = "Port Distance", y = "Sales", kind = "scatter")
dfm2 = grouped_data2.set_index("Port Distance", inplace=False)

Insights
While we failed to reject our null hypothesis that there is no correlation between port distance and average profit per item, this can be due to a variety of factors. I initially thought that proximity to ports would reduce shipping costs, but this may not be the case if there are other factors influencing shipping costs. My set of ports is also incomplete, and a more complete set of ports that this superstore chain uses could make the analysis more accurate. Finally, it is doubtful that this dataset contains all transactions, so it could be that there is insufficient data to demonstrate the relationship.

I've added a toggle in the code to enable all ports in the US. Doing so reduces the p values of both profit per item and sales slightly, but it doesn't change enough to make the relationship between port distance and profit per item statistically significant.

Despite this, we did find that proximity to ports correlates strongly with overall sales, which indicates that locations near ports get more business. This may be due to the fact that large ports tend to have large cities associated with them. Conducting further analysis (perhaps using the set of ports that this unknown superstore uses) could create further insights.