Tutorial - Data Wrangling, Visualization and Inference with Social Data

Learning how to use python to answer questions about the effect of Hurricane Maria in Puerto Rico

Posted by Ian Flores Siaca on October 15, 2018

This tutorial is based on the blog post in this webpage: https://www.bayesianpolitik.me/2018/10/16/analysis_hurricane.html. I strongly recommend you to read the blog post first to get an idea of the data and the context.

There is a GitHub repository for this tutorial if you want to run it by yourself available here: https://github.com/ian-flores/mortality_analysis_tutorial

DISCLAIMER: I’m trying this new thing of the tutorials for the analyses I will be making now on as blog posts. It would be nice/helpful if I could get feedback. All kinds of feedbacks are welcomed. Feel free to use the Contact tab to contact me.

Notebook Content:

  • #1) Getting and Cleaning Census Data
    • How to get data from the census
    • How to extract and manipulate data coming from a single table
  • #2) Getting and Cleaning Mortality Data
    • Cleaning data
    • Dealing with strings in Spanish
    • Grouping, summarizing, and splitting by different variables
  • #3) Mapping Mortality Rates
    • Creating maps in Python
    • Joining data from different sources
    • Dealing with spatial data in Python
  • #4) Estimating Mortality Rates in Puerto Rico
    • Bayesian Estimation with PyMC3
    • How to draw conclusions from data

01 - Getting and Cleaning Census Data

by Ian Flores Siaca

October 2018

Purpose of this Notebook

Learning

  • Dealing with the os library for basic functions
  • Standarizing names in Spanish with the unidecode library.

Project

  • Extract the population estimates for each municipality.
  • Clean the name of the municipalities in the census dataset.

Census Data

The United States established in 1840 a central office to oversee the census of the population of its states and territories. This office today is known as the United States Census Bureau. They perform a decenial census, the last one being in 2010. With this information they produce what is called PEPANNRES - Annual Estimates of the Resident Population in which they do population estimates by different geographic levels. In March 2018 they published the estimates for municipalities in the year 2017. This is exactly what we need for our analysis.

Getting the Data

The Census Bureau makes their data available through many pages and API's. Here we are going to use the FactFinder webpage and download the data manually, but it is worth noting that there are API wrappers in Python such as:

If you want to get the data manually it is available in the following LINK. However, it is included as a zip file in the repository for reproducibility purposes.

Let's code!

Load the libraries

  • pandas is a library to do data wrangling and dealing with data in general.
  • os is a library that allows us to interact with the computer in a lower level than python.
  • unidecode is a library that helps us deal with the names of the municipalities in Spanish. (Remember, PR is a Spanish speaking nation :D)
In [1]:
import pandas as pd
import os
import unidecode

Extract the data

The data is stored in a compressed form, we need to extract it and save it in a place where we can use it.

Steps:

  • Extract the data ---- unzip command
  • Store it the data/ folder ---- -d data/ command
In [2]:
os.system("unzip ../data/PEP_2017_PEPANNRES.zip -d ../data/")
Out[2]:
0

Load the data

pd.read_csv is a function from the pandas package that allows us to load Comma Separated Values (CSV) files into Python with ease. The head() method let's us see the top rows of the data frame.

In [3]:
data = pd.read_csv("../data/PEP_2017_PEPANNRES.csv", encoding='latin-1')
data.head()
Out[3]:
GEO.id GEO.id2 GEO.display-label rescen42010 resbase42010 respop72010 respop72011 respop72012 respop72013 respop72014 respop72015 respop72016 respop72017
0 0500000US72001 72001 Adjuntas Municipio, Puerto Rico 19483 19483 19472 19297 19116 19019 18798 18560 18276 17971
1 0500000US72003 72003 Aguada Municipio, Puerto Rico 41959 41959 41913 41532 41107 40707 40135 39539 38853 38118
2 0500000US72005 72005 Aguadilla Municipio, Puerto Rico 60949 60949 60766 59976 58978 58036 57078 55808 54525 53164
3 0500000US72007 72007 Aguas Buenas Municipio, Puerto Rico 28659 28659 28652 28333 28052 27782 27350 26913 26382 25850
4 0500000US72009 72009 Aibonito Municipio, Puerto Rico 25900 25900 25874 25537 25205 24879 24448 24040 23566 23108

Dealing with names in Spanish

Spanish is a beautiful language, however, when dealing with names when programming, sometimes it's hard given some inconsistencies. This comes because of a wrong encoding of the files, typing some of the names in English and some in Spanish, or simply skiping 'weird' characters. To deal with this inconsistencies, we convert the names to their english-letter representation and to an upper case. This way we standarize all the names for all the files.

Steps:

  • Extract the names of the municipalities from the strings
  • Convert the names to english-letter representation
  • Convert the strings to upper case format
In [4]:
municipalities = []

for i in range(0, 78):
    muni_raw = data['GEO.display-label'][i].split(" Municipio, Puerto Rico")[0]
    muni_clean = unidecode.unidecode(muni_raw)
    muni_upper = muni_clean.upper()
    municipalities.append(muni_upper)

data['ResidencePlace'] = municipalities
In [5]:
data_processed = data[['ResidencePlace', 'respop72017']]
data_processed.head()
Out[5]:
ResidencePlace respop72017
0 ADJUNTAS 17971
1 AGUADA 38118
2 AGUADILLA 53164
3 AGUAS BUENAS 25850
4 AIBONITO 23108

Save the progress | Clean our directories

Let's save the population estimates for 2017 to use the file later in our pipeline

In [6]:
data_processed.to_csv("../data/census.csv")

Finally, clean the data directory to only remain with the files we need to reproduce this analysis and produce it's output.

In [7]:
os.system("rm ../data/PEP_2017_PEPANNRES.csv ../data/PEP_2017_PEPANNRES_metadata.csv ../data/PEP_2017_PEPANNRES.txt ../data/aff_download_readme.txt")
Out[7]:
0

Optional Questions

1) What is the municipality with the highest population? How about the least?

2) Which municipality changed the most between 2016 and 2017?

3) How about between 2010 and 2017?

4) Can you find the municipality that has stayed the most stable (less variability) between all these years?

5) Are there some municipalities consistingly losing people and other winning them?

02 - Getting and Cleaning Mortality Data

by Ian Flores Siaca

October 2018

Purpose of this Notebook

Learning

  • Use of pandas group_by function.
  • Filtering data with string content.

Project

  • Compute the number of deaths per zone of each municipality.
  • Clean the name of the municipalities in the mortality dataset.

Mortality Data

Hurricane Maria struck Puerto Rico the 20th of September of 2017 at 6:15 a.m. entering through the municipality of Yabucoa as a Category 4 hurricane. According to the governments of Puerto Rico and the U.S. Virgin islands, the cost of the damage is estimated in $102 billion USD1. However, the impact wasn't only economical. Following a lawsuit presented by the Center for Investigative Journalism (CPI, by its spanish initials) and CNN, the Government of Puerto Rico, and more specifically the Demographic Registry, was forced to publish the individual-level data of all the deaths occurred from September 20, 2017 to June 11, 2018. This data is available in the following Dropbox Link from the CPI.

Let's code!

Load the libraries

In [1]:
import pandas as pd
import os

Here we are downloading the data directly from Dropbox and storing it under the data/ folder in the mortality.xlsx file.

In [8]:
os.system("wget --output-document='../data/mortality.xlsx' 'https://www.dropbox.com/s/k4wrb1ztwu0fwxh/Base%20de%20Datos%20Mortalidad%20en%20PR%20de%20septiembre%2018%20de%202017%20a%2011%20de%20junio%20de%202018%20entregada%20por%20Gobierno%20de%20PR%20al%20CPI.xls?dl=0'")
Out[8]:
0
In [9]:
data = pd.read_excel("../data/mortality.xlsx")
data.head()
Out[9]:
DeathNumber Volumen CertificateNumber InscriptionYear RegistryOffice RegistrationDate RegistrationDate_Year RegistrationDate_Month RegistrationDate_Day Name ... FuneralDate FuneralDate_Year FuneralDate_Month FuneralDate_Day DispositionName DispositionPlace FuneralFacilityName FuneralFacilityLicenseNumber FuneralFacilityPlace FuneralFacilityDirectorName
0 20099 34 186 2017 88 - COAMO 2017-09-26 2017 9 26 GLORIA ... 2017-09-22 00:00:00 2017.0 9.0 22.0 PORTA COELI CREMATION PLUS BAYAMON, PUERTO RICO FUNERARIA COAMEÑA NaN COAMO, PUERTO RICO DANIEL DIAZ AQUINIO
1 20147 34 1286 2017 44 - BAYAMON 2017-10-03 2017 10 3 MARISOL ... 2017-09-30 00:00:00 2017.0 9.0 30.0 MONTE CALVARIO CAGUAS, PUERTO RICO FUNERARIA ALTERNATIVE NaN BAYAMON, PUERTO RICO HECTOR COLON
2 20160 34 1085 2017 1 - NIVEL CENTRAL 2017-10-03 2017 10 3 RAMON ... 2017-09-26 00:00:00 2017.0 9.0 26.0 MUNICIPAL NUEVO ARECIBO, PUERTO RICO GONZALEZ NaN ARECIBO, PUERTO RICO ROLANDO VELEZ MENDEZ
3 20168 34 1089 2017 1 - NIVEL CENTRAL 2017-10-03 2017 10 3 GLORIA ... 2017-09-26 00:00:00 2017.0 9.0 26.0 NUEVO JERUSALEN UTUADO, PUERTO RICO UTUADO MEMORIAL NaN UTUADO, PUERTO RICO OLGA SAVEDA
4 20169 34 1297 2017 44 - BAYAMON 2017-10-03 2017 10 3 LUIS ... 2017-09-30 00:00:00 2017.0 9.0 30.0 NEW HORIZONS BAYAMON, PUERTO RICO FUNERARIA ASENCIO NaN BAYAMON, PUERTO RICO MELVIN PORTALATIN MONTALVO

5 rows × 187 columns

Filtering the data

  • We are using the ResidencePlace column as the column for the Municipality
  • There is people that died in Puerto Rico, but there residence place is not Puerto Rico, we want to exclude those data points from the analysis.
    • data[data.ResidencePlace.str.contains('PUERTO RICO')]
  • We also want to exclude those people we don't know their municipality of residence.
    • data[data.ResidenceZone != 'DESCONOCIDO']
In [4]:
data = data[data.ResidencePlace.str.contains('PUERTO RICO')]
data = data[data.ResidencePlace != 'PUERTO RICO, DESCONOCIDO']
data = data[data.ResidenceZone != 'DESCONOCIDO']

Grouping the data

We want to compute the number of deaths in each zone of each municipality:

  • Group by ResidencePlace and ResidenceZone
    • data.groupby(['ResidencePlace', 'ResidenceZone'])
  • Calculate the number of deaths per zone of each municipality
    • data.groupby(['ResidencePlace', 'ResidenceZone']).size().reset_index(name='Deaths')
In [5]:
df_grp = data.groupby(['ResidencePlace', 'ResidenceZone']).size().reset_index(name='Deaths')
df_grp.head()
Out[5]:
ResidencePlace ResidenceZone Deaths
0 PUERTO RICO, ADJUNTAS RURAL 63
1 PUERTO RICO, ADJUNTAS URBANO 50
2 PUERTO RICO, AGUADA RURAL 165
3 PUERTO RICO, AGUADA URBANO 37
4 PUERTO RICO, AGUADILLA RURAL 320

Extract the name of the municipalities from the ResidencePlace column

In [6]:
municipalities = df_grp.ResidencePlace.tolist()

for i in range(0,len(municipalities)):
    municipalities[i] = municipalities[i].split("PUERTO RICO, ")[1]
In [7]:
df_grp.ResidencePlace = pd.Series(municipalities)
df_grp.head()
Out[7]:
ResidencePlace ResidenceZone Deaths
0 ADJUNTAS RURAL 63
1 ADJUNTAS URBANO 50
2 AGUADA RURAL 165
3 AGUADA URBANO 37
4 AGUADILLA RURAL 320

Save our DataFrame as a CSV for future analysis

In [10]:
df_grp.to_csv("../data/mortality_grouped.csv")

Optional Questions

1) Did more women died or men?

2) Which month had the higher mortality?

3) Which was the average age of death? Did more infants died or older people?

4) Do you see any patterns in years of education?

5) Do you notice any patter in the ResidencePlace vs the Death Registry municipality? Why is this?

03 - Mapping Mortality Rates

by Ian Flores Siaca

October 2018

Purpose of this Notebook

Learning

  • Handling of Spatial Data in Python with geopandas
  • Visualizing Spatial Data
  • Joining data from different sources

Project

  • Visualize the spatial distribution of the mortality rates
  • Analyze spatial patterns in the data

Let's code

Load the libraries

  • Here we see the %matplotlib inline command, this tells the matplotlib package to display the plots in this notebook.
  • We are seeing as well the geopandas library for the first time. This is a very useful library to handle spatial data in Python.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import numpy as np
import unidecode
import os

Get the data

  • We are downloading a GeoJSON file, which is a text file formatted to contain spatial information by hierarchies.
    • Hierarchies can vary but it goes from the world, to a region, to a country, to a municipality, etc.
  • This GeoJSON is available from Miguel Rios GitHub repository.
In [22]:
os.system("wget 'https://raw.githubusercontent.com/miguelrios/atlaspr/master/geotiles/pueblos.json'")
os.system("mv 'pueblos.json' '../data/'")
Out[22]:
0
In [23]:
geo_muni = gpd.read_file('../data/pueblos.json')
geo_muni.head()
Out[23]:
STATE COUNTY NAME geometry
0 72 071 Isabela POLYGON ((-67.10327599999999 18.513426, -67.10...
1 72 005 Aguadilla POLYGON ((-67.10327599999999 18.513426, -67.10...
2 72 013 Arecibo POLYGON ((-66.58677804052712 18.48494807068136...
3 72 065 Hatillo POLYGON ((-66.76483496560731 18.48406974021553...
4 72 115 Quebradillas POLYGON ((-66.90143020590587 18.48455227031116...

On the geometry column we see a POLYGON object which is a spatial object that defines the boundary coordinates of the municipalities.

Plot the municipalities

  • Calling the GeoDataFrame by its name with the .plot() object, will plot the geometry column which in this case is our polygons of the municipalities
In [5]:
geo_muni.plot()
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0e4039cf8>

Cleaning the municipalities name

  • This is the same procedure as in notebook_01.
In [24]:
municipalities = []

for i in range(0, 78):
    muni_raw = geo_muni['NAME'][i].split(" Municipio, Puerto Rico")[0]
    muni_clean = unidecode.unidecode(muni_raw)
    muni_upper = muni_clean.upper()
    municipalities.append(muni_upper)
    
geo_muni['NAME'] = municipalities

Loading Census Data

In [7]:
census_data = pd.read_csv("../data/census.csv")
census_data = census_data.drop(['Unnamed: 0'], axis = 1)
census_data.head()
Out[7]:
ResidencePlace respop72017
0 ADJUNTAS 17971
1 AGUADA 38118
2 AGUADILLA 53164
3 AGUAS BUENAS 25850
4 AIBONITO 23108

Loading Mortality Data

In [25]:
mortality_data = pd.read_csv("../data/mortality_grouped.csv")
mortality_data = mortality_data.drop(['Unnamed: 0'], axis = 1)
mortality_data.head()
Out[25]:
ResidencePlace ResidenceZone Deaths
0 ADJUNTAS RURAL 63
1 ADJUNTAS URBANO 50
2 AGUADA RURAL 165
3 AGUADA URBANO 37
4 AGUADILLA RURAL 320

Calculating mortality rates

  • We first join the census data to the mortality DataFrame. This order matters as the mortality DataFrame has more information than the census data.
  • We then divide the number of deaths by the number of people in each municipality and the multiply by 1000 to get the mortality rate per 1000 people.
In [11]:
df = mortality_data.merge(census_data, on='ResidencePlace')
df['death_rate'] = (df['Deaths']/df['respop72017']) * 1000
df.head()
Out[11]:
ResidencePlace ResidenceZone Deaths respop72017 death_rate
0 ADJUNTAS RURAL 63 17971 3.505648
1 ADJUNTAS URBANO 50 17971 2.782260
2 AGUADA RURAL 165 38118 4.328664
3 AGUADA URBANO 37 38118 0.970670
4 AGUADILLA RURAL 320 53164 6.019111

Join the mortality data and the spatial data

In [13]:
geo_muni['death_rate'] = list(df.groupby(['ResidencePlace'])['death_rate'].sum())

Let's plot the death rate by municipality and compare it with the trajectory of the hurricane

In [20]:
figure, ax = plt.subplots(1)

geo_muni.plot(column = 'death_rate', 
              scheme = 'quantiles', 
              legend = True, 
              ax = ax,
              edgecolor='1', 
              linewidth = 0.3)

fig = plt.gcf()
fig.set_size_inches(7, 6)
fig.set_dpi(125)
plt.title("Mortality rate per 1,000 individuals\n Timeframe from September 20, 2017 to June 2018")
ax.set_axis_off()

plt.axis('equal')
plt.show()

Source: NYTimes

Optional Questions

1) Do you see a pattern between the trajectory of the Hurricane and the mortality rates' spatial distribution?

2) Do you see a difference between rural zones and urban zones? Their means? Their variances? Maybe remember the groupby function to do this step. We will go way more in depth in the next notebook.

04 - Estimating Mortality Rates through Puerto Rico

by Ian Flores Siaca

October 2018

Purpose of this Notebook

Learning

  • Getting to know the PyMC3 library
  • Starting to think about Bayesian methods
  • Think about uncertainty in the estimates instead of significance vs no-significance (*Not a fan of the p-value)

Project

  • Estimate the mortality rate for urban areas as compared to rural areas.
  • Identify which area needs more attention for future disasters.

Bayesian methods

Bayesian Paradigm

What if we have previous knowledge, or have expert knowledge regarding a certain subject that we would like to incorporate? With Bayesian methods, we can set something that is called a prior, which is a distribution for each parameter of the phenomenom we want to estimate or test. We combine this priors with the data (a.k.a Likelihood) to obtain a posterior distribution which allows us the necessary inference.

DISCLAIMER: Priors are controversial as some people have suggested that one can skew the results of the posterior distribution by setting the priors to certain values. This might be true in cases of few data points, however, in an open and reproducible science approach this should not be a concern as everyone alse can see your work. This is a big difference between using tools such as R and Python as compared to other click-and-touch software.

Source: The Conversation

History

  • Thomas Bayes discovered it.
  • Pierre-Simon Laplace greatly expanded the methods working with demographic data from France
    • Didn't know about Bayes discovery as British-French relations at the moment weren't great, even though Laplace was almost two centuries later.
    • Also invented the Central Limit Theorem, a crucial mathematical concept to understand computation bayesian methods.
    • Invented the mathematical system for inductive reasoning in probability.

Uses

  • Misile production estimation
    • Estimating the number of tanks the German Nazis were producing per factory
  • Cracking the encryption
    • Breaking the Lorenz cipher using Bayesian methods
  • Coast Guard Searchs
    • Searhing by quadrants for submarines in the Atlantic Ocean
  • Election Prediction

Let's code

Load the libraries

In [1]:
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
/home/kropotkin/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Load the data

In [2]:
census_data = pd.read_csv("../data/census.csv")
census_data = census_data.drop(['Unnamed: 0'], axis = 1)
In [3]:
mortality_data = pd.read_csv("../data/mortality_grouped.csv")
mortality_data = mortality_data.drop(['Unnamed: 0'], axis = 1)

Calculating mortality rates

  • We first join the census data to the mortality DataFrame. This order matters as the mortality DataFrame has more information than the census data.
  • We then divide the number of deaths by the number of people in each municipality and the multiply by 1000 to get the mortality rate per 1000 people.
In [4]:
df = mortality_data.merge(census_data, on='ResidencePlace')
df['death_rate'] = (df['Deaths']/df['respop72017']) * 1000
df.head()
Out[4]:
ResidencePlace ResidenceZone Deaths respop72017 death_rate
0 ADJUNTAS RURAL 63 17971 3.505648
1 ADJUNTAS URBANO 50 17971 2.782260
2 AGUADA RURAL 165 38118 4.328664
3 AGUADA URBANO 37 38118 0.970670
4 AGUADILLA RURAL 320 53164 6.019111

Here we are making two DataFrames, one with the information for the rural zones and one for the urban zones.

In [5]:
rural = df[df['ResidenceZone'] == 'RURAL']['death_rate']
urban = df[df['ResidenceZone'] == 'URBANO']['death_rate']

PyMC3, the package we are using for Bayesian estimation, requires the data in a certain structure, which is what we are doing in this code cell below.

In [6]:
total = pd.DataFrame(dict(death_rate = np.r_[rural, urban],
                         group = np.r_[['rural']* len(rural), 
                                      ['urban'] * len(urban)]))
total.head()
Out[6]:
death_rate group
0 3.505648 rural
1 4.328664 rural
2 6.019111 rural
3 5.299807 rural
4 4.111130 rural

Priors

As mentioned before, when doing bayesian estimation we need to set our priors. In this case I'm stating that our priors for the mean mortality rate in each zone should be close to the total death rate. In this case, I am modeling this priors with a TruncatedNormal distribution, bounded between 0 and a 1000, from the PyMC3 library. This would mean the extreme of either no one dying or everyone dying respectively. As we don't know much about the variability of the data, we model it for both zones with a Uniform distribution between 0 and 10. v is the degrees of freedom, we are centering the mean around 2 in this case with a Gamma distribution.

In [7]:
mu_total = total.death_rate.mean()
std_total = total.death_rate.std()
In [8]:
with pm.Model() as model:
    rural_mean = pm.TruncatedNormal('rural_mean', 
                                    mu = mu_total, 
                                    sd = std_total, 
                                    lower = 0, upper = 1000)
    urban_mean = pm.TruncatedNormal('urban_mean', 
                                    mu = mu_total, 
                                    sd = std_total, 
                                    lower = 0, upper = 1000)
In [9]:
std_low = 0
std_high = 10

with model:
    rural_std = pm.Uniform('rural_std', lower = std_low,
                          upper = std_high)
    urban_std = pm.Uniform('urban_std', lower = std_low, 
                          upper = std_high)
In [10]:
with model:
    v = pm.Gamma('v', alpha = 3, beta = 1)

Likelihood

In this code cell we are describing the final distribution, which is a StudentT distribution to compare the mean of the two groups at the end. We are passing the different arguments to the distribution and at the end with the observed argument we are feeding the model the data we have observed.

In [11]:
with model:
    lambda_rural = rural_std**-2
    lambda_urban = urban_std**-2
    
    rural_try = pm.StudentT('rural', nu = v, mu = rural_mean, 
                            lam = lambda_rural, observed = rural)
    urban_try = pm.StudentT('urban', nu = v, mu = urban_mean, 
                           lam = lambda_urban, observed = urban)

Here, we are calculating the differences of the resulting distributions to see if there is a difference between both zones.

In [12]:
with model:
    diff_of_means = pm.Deterministic('difference of means', 
                                     rural_mean - urban_mean)
    
    diff_of_stds = pm.Deterministic('difference of stds', 
                                    rural_std - urban_std)

Sampling

We are going to take samples from this distributions, in our case it will be 20,000 samples, over 4 cores, discarding the first 15,000 samples in each chain. This leaves us with a total of 140,000 samples from the distributions.

In [15]:
with model:
    trace = pm.sample(20000, cores = 4, tune = 15000)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, urban_std, rural_std, urban_mean, rural_mean]
Sampling 4 chains: 100%|██████████| 140000/140000 [01:21<00:00, 1714.77draws/s]

Let's look at the posterior of the mean of the mortality rate in urban zones. What do you see?

In [17]:
pm.plot_posterior(trace, varnames = ['urban_mean'], color = '#0d98ba')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad46d4c588>

We can observe that the mean is centered around 2.142 with a 95% chance that the real number is between 1.862 and 2.429, which means that we can say that in average, for every 1000 people, 2 died in urban zones. Below we can see that the standard deviation for the urban zones is 1.063.

In [18]:
pm.plot_posterior(trace, varnames = ['urban_std'], color = '#0d98ba')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad35509978>

Now, let's look at the posterior of the mean of the mortality rate in rural zones. Do you notice a difference?

In [115]:
pm.plot_posterior(trace, varnames = ['rural_mean'], color = '#0d98ba')

We can observe that the mean is centered around 4.126 with a 95% chance that the real number is between 3.81 and 4.451, which means that we can say that in average, for every 1000 people, 4 died in rural zones. Below we can see that the standard deviation for the rural zones is 1.207, which means that the variability between urban and rural zones was similar.

In [124]:
pm.plot_posterior(trace, varnames = ['rural_std'], color = '#0d98ba')

However, how does the rural distribution as a whole compare to the urban distribution?

In [128]:
pm.plot_posterior(trace, varnames = ['difference of means'], color = '#0d98ba', ref_val=0)

Above we can see that approximately, in avergae, 2 more people died in rural zones as compared to urban zones.

In [19]:
pm.forestplot(trace, varnames = ['rural_mean', 'urban_mean'])
Out[19]:
<matplotlib.gridspec.GridSpec at 0x7fad35454be0>

Finally, if we compare the different chains of the sample visually we can see that they are all very diiferent between zones.

Optional Questions

1) Do you think this analysis could incorporate other variables? 2) How do you think the results compare to the frequentist paradigm? Do you prefer to know the uncertainty in your results or have a single number to which you can refer to? 3) I encourage you to play with the PyMC3 package more as you can do very exciting things. (Such as model the result of football games)

References

This work was inpired in part by this work done by the PyMC3 development team:https://docs.pymc.io/notebooks/BEST.html