Skip to content

Spelling #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cost-based-ml/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Cost-based Machine Learning

So you've built an ML model and evaluated it's performance on a testing dataset? In case of binary classification, the evaluation tells you how many mistakes the model made, i.e. the percentage of false positives and false negatives, and the same stats for the correct behavior of the model, namely, true positives and true negatives. Of course, the fewer the errors, the better, but for any realistic application the percentage of errors is substantial and it is often unclear if the model is worth using. Moreover, if you just look at the total error rate (sum of false positives and false negatives) you may convince yourself that the model is useless. For example, suppose that 90% of the data points belong to class 0 and the rest to class 1, and your model gives 15% total error. This means that if you employ the model you will be making a mistake 15% of the time and if you don't use the model at all (and just assume that all data points belong to class 0) you will be making a mistake only in 10% of cases. Seems like the model is useless in this case, doesn'it?
So you've built an ML model and evaluated it's performance on a testing dataset? In case of binary classification, the evaluation tells you how many mistakes the model made, i.e. the percentage of false positives and false negatives, and the same stats for the correct behavior of the model, namely, true positives and true negatives. Of course, the fewer the errors, the better, but for any realistic application the percentage of errors is substantial and it is often unclear if the model is worth using. Moreover, if you just look at the total error rate (sum of false positives and false negatives) you may convince yourself that the model is useless. For example, suppose that 90% of the data points belong to class 0 and the rest to class 1, and your model gives 15% total error. This means that if you employ the model you will be making a mistake 15% of the time and if you don't use the model at all (and just assume that all data points belong to class 0) you will be making a mistake only in 10% of cases. Seems like the model is useless in this case, doesn't it?

This, however, is a rather simplistic way of looking the model evaluation. The truth is that the different types of mistakes the model makes have different intrinsic costs associated with them, depending on the domain and application. Frequently, even when the total error looks bad, when costs are taken into account, the end result clearly favors the use of ML.

Expand Down
4 changes: 2 additions & 2 deletions cost-based-ml/cost_based_ml.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def batch_prediction_data_bucket_key(output_uri_s3, batch_prediction_id):
key += "batch-prediction/result/{}-{}.gz".format(batch_prediction_id, datasource_filename)
return bucket, key

# read batch prediction results from S3 and turn them into an numpy array
# read batch prediction results from S3 and turn them into a numpy array
def read_test_predictions(bucket, key):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, key)
Expand All @@ -52,7 +52,7 @@ def read_test_predictions(bucket, key):
data = np.loadtxt(StringIO(predictions_str), dtype = {'names': names, 'formats': formats}, delimiter=',', skiprows=1, usecols=cols)
return data

# this historgram replicates what the Amazon ML console is showing for model evaluation
# this histogram replicates what the Amazon ML console is showing for model evaluation
def plot_class_histograms(score_n_true_label):
class_1_scores = [score for (score, true_label) in score_n_true_label if true_label == 1]
class_0_scores = [score for (score, true_label) in score_n_true_label if true_label == 0]
Expand Down
2 changes: 1 addition & 1 deletion k-fold-cross-validation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ If you are a Python 2 developer and do not already have `virtualenv` and `pip` t
sudo apt-get update
sudo apt-get install python-pip python-virtualenv

Users of other operating systems and package managers can learn more about installing `pip` [here](http://pip.readthedocs.org/en/stable/installing/), and about installing `virtualenv` [here](http://virtualenv.readthedocs.org/en/latest/installation.html).
Users of other operating systems and package managers can learn more about [installing `pip`](http://pip.readthedocs.org/en/stable/installing/), and about [installing `virtualenv`](http://virtualenv.readthedocs.org/en/latest/installation.html).

After you’ve installed the `virtualenv` and `pip` tools, run:

Expand Down
2 changes: 1 addition & 1 deletion k-fold-cross-validation/collect_perf.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ def collect_perf(eval_id_list):
kfolds = len(eval_id_list)
eval_auc_map = collect_perf(eval_id_list) # start polling & collect

# Comput the mean/variance of auc scores. Casting kfolds to float for
# Compute the mean/variance of auc scores. Casting kfolds to float for
# Python 2 compatibility.
avg_auc = sum([x for x in eval_auc_map.values()]) / float(kfolds)
var_auc = sum([(x - avg_auc) ** 2 for x in eval_auc_map.values()]) / float(
Expand Down
2 changes: 1 addition & 1 deletion ml-tools-python/wait_for_entity.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
ev = evaluation
bp = batch prediction

Useage:
Usage:
python wait_for_entity.py entity_id [entity_type]
"""
import boto
Expand Down
5 changes: 2 additions & 3 deletions social-media/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,7 @@ To gather the training data, run the following command:

Substitute your company's twitter handle instead of @awscloud and
configure your Twitter API credentials in config.py. Learn how to
obtain your credentials
[here](https://dev.twitter.com/oauth/overview/application-owner-access-tokens).
[obtain your credentials](https://dev.twitter.com/oauth/overview/application-owner-access-tokens).

This will produce a file called `line_separated_tweets_json.txt` that
other scripts will read later.
Expand Down Expand Up @@ -218,7 +217,7 @@ This script requires that `config.py` is present and contains
appropriate values. Description of the configuration required in
`config.py` is as follows:

* *awsAccountId* : The AWS Account Id corresponding to the credentials being used
* *awsAccountId* : The AWS Account ID corresponding to the credentials being used
with boto. See [docs](http://docs.aws.amazon.com/general/latest/gr/acct-identifiers.html)
for details.
* *kinesisStream* : The name being given to the Kinesis stream. See
Expand Down
2 changes: 1 addition & 1 deletion social-media/push-json-to-kinesis.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"""
Utility to call Amazon Kinesis stream using payload from a file that contains line
separated json. This script is used in conjunction with
create-lambda-function.py, which expectes the Kinesis stream to provide the
create-lambda-function.py, which expects the Kinesis stream to provide the
input on which predictions are made. All json data being pushed to kinesis is
first converted to string to string key value pairs as that is the expected
format by Amazon Machine Learning.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ public static void main(String[] args) throws IOException {
/**
* @param args command-line arguments:
* mlModelid
* score threshhold
* score threshold
* s3:// url where output should go
*/
public UseModel(String[] args) {
Expand Down
2 changes: 1 addition & 1 deletion targeted-marketing-python/use_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
generate predictions on new data. This script needs the id of the
ML Model to use. It also requires the score threshold.

Useage:
Usage:
python use_model.py ml_model_id score_threshold s3_output_url

For example:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ import scala.io.Source
* to make batch predictions.
*
* command-line arguments:
* mlModelid scoreThreshhold s3://url-where-output-should-go
* mlModelid scoreThreshold s3://url-where-output-should-go
*/
object UserModel extends App {
val unscoredDataUrl = "s3://aml-sample-data/banking-batch.csv"
val dataSchema = getClass.getResourceAsStream("/banking-batch.csv.schema")

require(args.length == 3, "command-line arguments: mlModelid scoreThreshhold s3://url-where-output-should-go")
require(args.length == 3, "command-line arguments: mlModelid scoreThreshold s3://url-where-output-should-go")
val mlModelId = args(0)
val threshold = args(1).toFloat
val s3OutputUrl = args(2)
Expand Down