Skip to content

Commit

Permalink
Add some material for the two initial extras sections
Browse files Browse the repository at this point in the history
  • Loading branch information
projectgus committed Jun 4, 2013
1 parent 381cdd6 commit c9baaf6
Show file tree
Hide file tree
Showing 4 changed files with 112 additions and 7 deletions.
8 changes: 4 additions & 4 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ map:
- title: Extras
caption: Additional workshop content
subpages:
- title: The Pandas library
path: /extras/pandas.html
caption: An introduction to using Pandas for data analysis.
- title: Alternative Approaches
path: /extras/alternatives.html
caption: Some other ways to store and process data.
- title: Open Data
path: /extras/opendata.html
caption: The Open Data movement and some places to find open data sets.
caption: Some places to find open data sets.
64 changes: 64 additions & 0 deletions extras/alternatives.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---

layout: ots
title: Alternative Approaches

---

In this workshop we've mostly looked at processing unstructured data
from plain text files. This is a very common and simple way to come
across data, but it's not the only way and it's not always the easiest
way to work with data.

# Relational Databases

Relational databases provide a way of storing data in tables with
relationships between them. Many large organisations and websites
store their data in some kind of relational database.

For instance, the OpenFlights data we worked with in the
[CSV](../core/csv.html) chapter is almost certainly exported from a
relational database of some kind. The relational database holds all of
its data in tables, for instance the "airports" table hold all the
airports and the "routes" table would hold all the airline routes.

However the relational database also holds *relations* between
different kinds of data - for example it can know that all airline
routes in the routes table contain references to a source and a
destination airport, and that these airports should exist in the
airports table.

We often use a query language called SQL to retrieve information from
a relational database. For example, here is a made-up SQL query to
count the number of airports in Russia:

SELECT COUNT(*) FROM airports WHERE country = 'Russia';

You can integrate SQL queries into other general purpose programming
languages like Python.

OpenTechSchool doesn't have specific workshops about SQL yet, although
"Django 101" uses SQL for its databases. In the meantime you might
want to check out Zed Shaw's book
[Learn SQL The Hard Way](http://sql.learncodethehardway.org/) (free to
read online.)


# Pandas

[Pandas](http://pandas.pydata.org/) is a suite of data analysis tools
for Python, and it allows you to do more complex data modelling and
analysis with Python.

For this workshop we haven't needed Pandas, but if you're looking to
use Python for a lot of numerical data analysis then you should look
into it - there are tutorials linked from the homepage. Pandas can
make complex tasks much easier to work with.

Pandas also makes it easy to integrate with more complex data sources
than simple text files. For example, here's [an IPython Notebook that
uses Pandas to import data the Guardian published regarding the Gaza-Israel 2012 crisis](https://gistpynb.herokuapp.com/4121857).
The Guardian publishes this data in the format of "Google Fusion
Tables", and Pandas can read this format directly from the web.


37 changes: 37 additions & 0 deletions extras/opendata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---

layout: ots
title: Open Data Sources

---

In this workshop we a small dataset published by the
[OpenFlights](http://openflights.org/) project. This data is published
under the Open Database License, one of several open data licenses
that grants rights to anyone who wants to use or redistribute the
data.

In recent years there has been a strong movement encouraging
organisations to publish data openly on the web. As a result there are
many public data repositories, both government and non-government,
that you can source data from:


* The
[Google Public Data Explorer](http://www.google.com/publicdata/directory)
indexes many public datasets and features an in-browser data explorer.
You can also download the data to perform more in-depth analysis.

* UK's Guardian Newspaper [Data Store](http://www.guardian.co.uk/data)
provides a wide range of data and data-based analysis.

* The [World Bank](http://data.worldbank.org/) publishes its data
catalog online.

* Numerous governments, including [Australia](http://data.gov.au/),
the [European Union](http://ec.europa.eu/atoz_en.htm) and the
[United States](http://www.data.gov/) have open dataset repositories.

* Some countries are sponsoring open data "hackathons" to raise
awareness and find new uses for their data, for instance
[GovHack](http://www.govhack.org/) in Australia.
10 changes: 7 additions & 3 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,14 @@ workshop then that will be perfect.

# Extra fun stuff

* [The Pandas library](extras/pandas.html) - An introduction to using Pandas for data analysis.
* [Alternative Approaches](extras/alternatives.html) - Other ways to store and process data (Pandas, SQL databases.)

* [Open Data](extras/opendata.html) - About Open Data.
* [Open Data Sources](extras/opendata.html) Data.

# Reference material

* TODO
* [IPython NBViewer home page](http://nbviewer.ipython.org/)

* [IPython Notebook gallery](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks)
* [matplotlib gallery](http://matplotlib.org/gallery.html)

0 comments on commit c9baaf6

Please sign in to comment.