# Introduction to Data Analytics

For this session, we are using Pandas, which is a Python library for data analysis. You can [read more about Pandas](https://pandas.pydata.org/about/index.html).




## Part 1: Introducing the Data Frame

In Pandas, data is held in a Dataframe, which is like a spreadsheet or database table
 * Each columns has a name
 * Each row contains data from one 'individual'

In this part you will learn how to:

 1. Import the Pandas library
 1. Open a CSV file and create a dataframe
 1. Have a look at the data in the dataframe


 ### Import the pandas library
 We use Python's 'import as' so that we have a short name. Since pandas is a very large library, it is probably not a good idea to use on import without using `as` to give a name.

 `import * from pandas`

In [None]:
import pandas as pd

### Task 1.1 Loading a CSV file
The CSV file can be loaded from the PC you are using. Check you have downloaded it somewhere.

The data is about the country of birth by Borough, age group and sex. It is from the 2011 census, so not up to date.


In [None]:
from google.colab import files
uploaded = files.upload()

Saving LondonCOB.csv to LondonCOB.csv


The uploaded file can now be converted to a dataframe: note the name used here **must be the same as the name** of the uploaded file. 

The variable `df` below holds the dataframe. Writing `df` on a line on its own causes a summary of the dataframe to be printed. Fortunately, the system knows not to print too many lines.

In [None]:
import io
df = pd.read_csv(io.StringIO(uploaded['LondonCOB.csv'].decode('utf-8')))
df

This data is an example of ['narrow' or 'tall' data](https://en.wikipedia.org/wiki/Wide_and_narrow_data), with lots of rows (approximately 67,000). 
 * Each column holds one 'variable'.
 * Each row holds data about one instance. 

However, it is not at all easy to understand the data in this format.

### Task 1.2 Unique Values in a Column
To start exploring the data we have loaded, we can find out how many different values there are in each column. For example, the code below shows the different age values.

In [None]:
# age values
df.Age.unique()

In [None]:
# Add code for looking at the values of Area, BirthCountry and BirthRegion

### Answer some questions about the data

1. Are all countries included?
2. How do people from other countries appear in the data?
3. Are all the 'Area' value London Boroughs?
4. Why do you suppose other Areas are included?

## Part 2: Selecting, Transforming and Viewing Data

In this section, we learn about selecting data from the dataframe. In excel, this is described a 'filter'. The concept is the same, but this word is not used here (there is a filter function in Pandas).

We then introduce the Pivot table which transforms the data from narrow (tall) to wide

### Task 2.1 Selecting Data

The data for one Borough can be selected as shown below. Select some other subsets of the data:

* Choose two Boroughs of interest to you (e.g. where you work and where you live)
* Select data for one BirthRegion or one BirthCountry

In [None]:
th = df[(df['Area']=='Tower Hamlets')]
th

**What no loops?** You might expect that we would need to loop through the code to accumulate the unique values. However, this is not the case. 

In fact, we will not be using the `while` keyword anywhere in this notebook.

It is possible to use two conditions, as shown below. Note that the '&' operator (or '|' for or) is used to combine them. This is not the standard Python 'and' operator but is overloaded for the types being used here: the arguments are of type 'Series' and the effect is to intersect the two series.

Uncomment the code in the cell below to see the type.


In [None]:
# type(df['Area']=='Tower Hamlets')

In [None]:
th_men = df[(df['Area']=='Tower Hamlets') & (df['Sex'] == 'Males')]
th_men

To understand more about selection, you can look at the [user guide section](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html). However, it is complex!

Note that it is not very easy to extract a single number from the dataframe in this 'narrow' format as it is not 'indexed'. For an example, look at the code below, extracting a single value. 

We will return to this below.

In [None]:
singleRow = df[(df['Area']=='Tower Hamlets') & (df['Sex'] == 'Males') & (df['BirthCountry'] == 'Ghana') & (df['Age'] == 'Age 20 to 24')]
singleRow['UsualResidents'] 


In [None]:
singleRow['UsualResidents'] 

### Task 2.2 The Pivot Table

The Pivot table gives a summary of the data in a number of rows and columns:
* The rows contain values for case. This is specified by the `index` argument. 
* The values shown are given by the `values` argument. Here it is the number of residents.
* One or more columns can be created, analysing the values by one of the other columns. This is specified by the `columns` argument. 
* The result table has far fewer entries than the original dataframe. Values that are neither distinguished by a row value or a column must be **aggregated**. Here, we sum that numbers, specified in the `aggfunc` argument.

The following example gives a breakdown number of people by Sex (two columns) and Birth Country (rows).

In [None]:
p = th.pivot_table(values=['UsualResidents'], index=['BirthCountry'], columns=['Sex'], aggfunc='sum')
p

Create some other pivot table. The cell below shows another example; try this and invent some more. 

In [None]:
# p1 = th.pivot_table(values=['UsualResidents'], index=['Age'], columns=['BirthRegion'], aggfunc='sum')
# p1

### Task 2.3 Sorting the Pivot Table

We can sort the pivot table using one of the columns. Try the following example and then sort the other ones you have created.

In [None]:
p=p.sort_values(by=('UsualResidents', 'Males'))
p

### Task 2.4 Plotting Data

Data can be plotted. This uses the matplotlib library, which is also very complex. However, Pandas provides a simplified plot function so that we do not have to use matplotlib explicitly (*at least at first*).

Plot function arguments include:
* `kind`: different kinds of plot e.g. 'bar', 'barh' or 'pie'
* `figsize`: written as (width, height)
* `stacked`: true or false
* `logy`: show log of the data
* `subplots`: see example below
* `title`: plot title

[The Pandas guide on visualisation is useful](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)


In [None]:
p.plot(kind="bar",figsize=(15,20), logy=True) 


The following plots show more features.

In [None]:
#p.plot(kind="pie",figsize=(20,20), subplots=True) 
#p1.plot(kind='bar', stacked=True, figsize=(15,10))
#p1.plot(kind='bar', subplots=True, figsize=(15,20))

## Part 3: Adding and Deleting Data

It is useful to be able to add (and delete) data. Adding new columns is often needed as part of an analysis to present the data in a new way. 

### Task 3.1 Selecting Rows

Not all the areas are London Boroughs. We will discard the rows that are not for London Boroughs. The following steps are needed:

1. We define a function that returns true if an area is a London Borough.
1. We `apply` this function to the `Area` column. 

In [None]:
def isLondon(area):
 l = area in {'Barking and Dagenham', 'Barnet', 'Bexley', 'Brent', 
 'Bromley', 'Camden', 'City of London', 'Westminster', 'Croydon',
 'Ealing', 'Enfield', 'Greenwich', 'Hackney', 'Hammersmith and Fulham',
 'Haringey', 'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',
 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth', 
 'Lewisham', 'Merton', 'Newham', 'Redbridge', 'Richmond upon Thames', 
 'Southwark', 'Sutton', 'Tower Hamlets', 'Waltham Forest', 'Wandsworth'}
 return l

#isLondon('Sutton')
#isLondon('Wales')

df_London = df.loc[df['Area'].apply(isLondon) == True]
df_London

### Task 3.2 Calculating Totals

A problem with our approach so far is that we looked at absolute numbers of people. But this cannot be used to compare the populations structure in different boroughs as they vary in absolute size.

We can calculate the totals using a pivot table. 

In [None]:
totals = df_London.pivot_table(values='UsualResidents', index=['Area'], aggfunc='sum')
totals

**Challenge** Plot these populations in sorted order

We can use the totals dataframe to access the total for one Borough. The following code shows how:

In [None]:
totals.loc['Redbridge']['UsualResidents']

### Task 3.3 Adding a New Column

We will add a new column to the London dataframe with the population proportion. The steps are:

1. Define a new function to do the calculation
1. Apply it to every row


In [None]:
def calcProportion(residents, borough):
 total = totals.loc[borough]['UsualResidents']
 return residents / total

df_London['Proportion'] = df_London.apply(lambda df: calcProportion(df['UsualResidents'], df['Area']),axis=1)
df_London


We can use the new column is a pivot table

In [None]:
p4 = df_London.pivot_table(values='Proportion', index='Area', columns=['BirthRegion'], aggfunc='sum')
p4

Then we can plot a figure comparing the region of birth in different Boroughs.

In [None]:
p4.sort_values(by=['Europe']).sort_values(by=['Sutton'],axis=1, ascending=False).plot(
 kind='bar', stacked=True, figsize=(15,10))

### Task 3.4 Challenge Problems

Try some further analyses. For example:

1. Add a column distinguishing inner and outer London Boroughs. [They are listed here](https://en.wikipedia.org/wiki/Inner_London) but watch out for different spelling.

1. Look at the age structure of differnt Boroughs

1. Change this around, to ask where do people of a certain age live? Now we want to normalise data (i.e. calculate proportions) using totals based on age group.