Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Output MLDataset #22

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
8b48bae
updated tests for multi dataset
raamana Jan 31, 2019
c3ced31
fixing incorrect decorator
raamana Feb 4, 2019
078203b
hiding the feature check method as its not meant to be for users
raamana Feb 4, 2019
62a05db
faster and concise and readable way to calc class sizes
raamana Feb 4, 2019
4a68833
tests for label validity - establishing behaviour
raamana Feb 4, 2019
54f018b
classs id validity
raamana Feb 4, 2019
a507548
fixing a deprecating warning, clean up
raamana Feb 4, 2019
6c22b2e
None is not valid label or ID anymore
raamana Feb 4, 2019
f3d9b44
helpers to check label validity
raamana Feb 4, 2019
c1cc4cf
helpers to check class id validity
raamana Feb 4, 2019
51ae329
quick check for 1-to-1 mapping when inputs are all dictionaries
raamana Feb 4, 2019
83533f8
maintaining an internal id to label dict to ensure there is a 1 to 1 …
raamana Feb 4, 2019
bc7269a
id to label map checker
raamana Feb 4, 2019
e28de3a
checking ids and labels as samples are added
raamana Feb 4, 2019
0e72cce
additional checks as they labels/class_ids are changed en-masse
raamana Feb 4, 2019
94d29be
succinct
raamana Feb 4, 2019
940cc3d
reusable helper
raamana Feb 4, 2019
a9e62fe
checks on setters for labels/ids
raamana Feb 4, 2019
f4371fc
code style
raamana Feb 4, 2019
396bc0e
moved to right folder
raamana Feb 4, 2019
d927024
generalizing method for label comparison
raamana Feb 5, 2019
2d1f416
more helpful error msg
raamana Feb 5, 2019
7da9371
state and type indicators for non multi output datasets
raamana Feb 5, 2019
0f76e5e
adapting ARFF support in the context of multioutput datasets
raamana Feb 5, 2019
cea9ae6
more centralized handling of class id and label
raamana Feb 5, 2019
11a88bd
simplifying logic for label equality
raamana Feb 5, 2019
61ee3c2
setting num outputs as samples are populated
raamana Feb 5, 2019
b23b155
saving sample features after other integrity constraints are met
raamana Feb 5, 2019
511f4da
check for num outputs
raamana Feb 5, 2019
d8c5a1d
simplifying logic: features are ndarray
raamana Feb 5, 2019
eeb38a0
attr
raamana Feb 5, 2019
52a5a5f
as the methods and attr are growing a lot, removing this for convenience
raamana Feb 5, 2019
afc8e10
basic checks for single output datasets
raamana Feb 5, 2019
28f12ae
generalized data and labels method
raamana Feb 5, 2019
95d8a25
updating tests and init
raamana Feb 5, 2019
09ee792
rough initial implementation for MultiOutputMLDataset
raamana Feb 5, 2019
75a074d
PEP
raamana Feb 5, 2019
76a1f75
bug in private variable assignment
raamana Feb 5, 2019
1e5ce3d
not maintaining multidataset here
raamana Feb 5, 2019
e86a58f
improving repr/str when num_classes is more than 10
raamana Feb 5, 2019
aaa64a9
more direct membership test
raamana Feb 5, 2019
63c41bb
rudimentary check to ensure its not multi-output dataset
raamana Feb 5, 2019
f18fc76
setting default value only if not set already while loading
raamana Feb 5, 2019
ec2f904
rudimentary estimation of num outputs, trusting validation prior save…
raamana Feb 5, 2019
3f252d3
more general checks
raamana Feb 5, 2019
df574bd
loosening unnecessarility conservative protection, to potentially ove…
raamana Feb 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pyradigm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@

if version_info.major==2 and version_info.minor==7:
from pyradigm import MLDataset, cli_run, check_compatibility
from multiple import MultiDataset
from multiple import MultiDataset, MultiOutputMLDataset
elif version_info.major > 2:
from pyradigm.pyradigm import MLDataset, cli_run, check_compatibility
from pyradigm.multiple import MultiDataset
from pyradigm.multiple import MultiDataset, MultiOutputMLDataset
else:
raise NotImplementedError('pyradigm supports only 2.7 or 3+. '
'Upgrade to Python 3+ is recommended.')
Expand Down
94 changes: 92 additions & 2 deletions pyradigm/multiple.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,95 @@
'Upgrade to Python 3+ is recommended.')


class MultiOutputMLDataset(MLDataset):
"""
New class allowing the labels for a sample to be a vector.

Recommended way to construct the dataset is via add_sample method, one sample
at a time, as it allows for unambiguous identification of each row in data matrix.

This constructor can be used in 3 ways:
- As a copy constructor to make a copy of the given in_dataset
- Or by specifying the tuple of dictionaries for data, labels and classes.
In this usage, you can provide additional inputs such as description
and feature_names.
- Or by specifying a file path which contains previously saved
MultiOutputMLDataset.

Parameters
----------
filepath : str
path to saved MLDataset on disk, to directly load it.

in_dataset : MLDataset
MLDataset to be copied to create a new one.

data : dict
dict of features (keys are treated to be sample ids)

labels : dict
dict of labels
(keys must match with data/classes, are treated to be sample ids)

classes : dict
dict of class names
(keys must match with data/labels, are treated to be sample ids)

description : str
Arbitrary string to describe the current dataset.

feature_names : list, ndarray
List of names for each feature in the dataset.

encode_nonnumeric : bool
Flag to specify whether to encode non-numeric features (categorical,
nominal or string) features to numeric values.
Currently used only when importing ARFF files.
It is usually better to encode your data at the source,
and them import them to Use with caution!

Raises
------
ValueError
If in_dataset is not of type MLDataset or is empty, or
An invalid combination of input args is given.
IOError
If filepath provided does not exist.

"""

_multi_output = True


def __init__(self,
num_outputs=None,
filepath=None,
in_dataset=None,
data=None,
labels=None,
classes=None,
description='',
feature_names=None,
encode_nonnumeric=False):
super().__init__(filepath=filepath,
in_dataset=in_dataset,
data=data, labels=labels, classes=classes,
description=description,
feature_names=feature_names,
encode_nonnumeric=encode_nonnumeric)

self._num_outputs = num_outputs


def _check_labels(self, label_array):
"""Label check for multi-output datasets: label for a subject can be a vector!"""

if any([self._is_label_invalid(lbl) for lbl in label_array]):
raise ValueError('One or more of the labels is not valid!')

return np.array(label_array)


class MultiDataset(object):
"""
Container data structure to hold and manage multiple MLDataset instances.
Expand Down Expand Up @@ -137,7 +226,8 @@ def __str__(self):
string = "{}: {} samples, " \
"{} modalities, " \
"dims: {}\nclass sizes: ".format(self._name, self._num_samples,
self._modality_count, self._num_features)
self._modality_count,
self._num_features)

string += ', '.join(['{}: {}'.format(c, n) for c, n in self._class_sizes.items()])

Expand Down Expand Up @@ -207,7 +297,7 @@ def _get_data(self, id_list, format='MLDataset'):
# getting container with fake data
subset = self._dataset.get_subset(id_list)
# injecting actual features
subset.data = { id_: data[id_] for id_ in id_list }
subset.data = {id_: data[id_] for id_ in id_list}
else:
raise ValueError('Invalid output format - choose only one of '
'MLDataset or data_matrix')
Expand Down
Loading