Skip to content

Latest commit

 

History

History
40 lines (29 loc) · 1.61 KB

python-data-science.md

File metadata and controls

40 lines (29 loc) · 1.61 KB

The Scipy stack includes...

  • NumPy: Base N-dimensional array package
  • SciPy library: Fundamental library for scientific computing
  • Matplotlib: Comprehensive 2D Plotting (like ggplot2 for Python)
  • IPython: Enhanced Interactive Console (notebook like R Studio or Mathematica)
  • Sympy: Symbolic mathematics
  • pandas: Data structures & analysis (like R for Python)

One easy sane thing to do is just install Anaconda, a big package of Python data science stuff that will manage dependencies and keep itself up-to-date reasonably well. You can keep it out of your PATH so it remains scoped / non-conflicting.

cd ~/anaconda/bin
ipython notebook

A couple fun things to try:

Pandas cookbook is a great introduction to the whole stack.

NLTK (natural language toolkit) is fun to try. Skim the book or docs.

E.g., as we were looking at on 8 May:

import nltk
nltk.download()
from nltk.book import *
text1.dispersion_plot(["whale","sea","captain","harpoon"])
text3.generate()

def lexical_richness(text):
    return len(text)/len(set(text))
    
richness_map = [lexical_richness(x) for x in (text1, text2, text3)]

text7.collocations()

just_words = nltk.Text([x for x in text1 if x.isalpha()])

Naive Bayesian classifiers are hella cool and reasonably easy. (Discern gender from last letter in name, etc.)