[citadel] Add events (WIP) #5

valeriocos · 2019-07-05T16:34:01Z

This PR proposes an implementation to store items coming from Perceval. In a nutshell the approach leverages on one of the two scenarios designed for ElasticSearch which is time series data. Thus, Perceval items are considered as events and stored in an index (which is aliased). The responsability to
assign unique identitfiers to such items is delegated to ElasticSearch, thus Perceval items with the same uuid are indexed more than once.

Note that this is WIP, refinements on code/comments may be the target of further iterations.

Different approaches have been tried, they are presented in commit: [bin] Share performance tests and results. Commit [collections] Add lookups collection is part of one of the approaches evaluated.

This code enhances the write method to include the item_type and field_id params, the former defines the type of items stored, while the later allows to point to the field used as unique identifier to store items. Signed-off-by: Valerio Cosentino <[email protected]>

This code includes the method `set_alias`, which allows to set an alias on a target index. Tests have been added accordingly.

This code eventizes Perceval items, which are stored in a collection leveraging on the ElasticSearch storage engine. Tests have been added accordingly. Signed-off-by: Valerio Cosentino <[email protected]>

Signed-off-by: Valerio Cosentino <[email protected]>

valeriocos · 2019-07-05T16:34:41Z

Results and discussions are also reported below:

- Results details:

# A) events index write-only
Repo https://github.com/elastic/elasticsearch processed: time 00:04:43, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:44, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:28, events 85190

# B) events index write-only, lookups write-and-update
Repo https://github.com/elastic/elasticsearch processed: time 00:05:51, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:05:27, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:05:23, events 85190

# C) events index write-and-update (no items mod)
Repo https://github.com/elastic/elasticsearch processed: time 00:04:32, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:30, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:29, events 85190

# D) events index write-and-update (items mod)
Repo https://github.com/elastic/elasticsearch processed: time 00:05:06, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:53, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:46, events 85190


- Discussions:
4 tests have been conducted to evaluate approaches to save data to ElasticSearch, they are described below.

The approach A consists in writing Perceval items to an index, leaving ElasticSearch the responsability
to assign unique identifiers, thus Perceval items with the same `uuid` are indexed more than once.

The approach B extends the approach A by keeping the metadata information within a separated index in order
to know the latest time information (i.e., `metadata_timestamp`, `metadata_updated_on`) of a given Perceval item.
As can be seen from the results, this extra step decreases the performance of around 20%.

The approach C consists in writing Perceval items to an index, however the unique identifiers are set using
the Perceval `uuid` values. Thus, the operations performed on the index concern writing and updates, since
the same item may be retrieved several times. In this case the performance is better then approach A, since
the documents used to test this approach are not modified, thus it is possible that Lucene performs some
optimization and do notreindexed them again.

The approach D is similar to the approach C, but the documents are modified. As can be seen, this approach
performs worse than the approach A (around 6%)

sduenas

Please check my comments.

sduenas · 2019-07-09T11:57:33Z

citadel/storage_engines/elasticsearch.py

@@ -130,12 +104,15 @@ def create_index(self, index_name, mapping):
            logger.error(msg)
            raise StorageEngineError(cause=msg)

-    def write(self, resource, data, chunk_size=CHUNK_SIZE):
+    def write(self, resource, data, item_type=ITEMS, chunk_size=CHUNK_SIZE, field_id=None):


Why these new parameters are needed? How do you expect to use this method in the future?

Why these new parameters are needed?

item_type can be used to flag or filter data

field_id can be useful to point which attribute should be considered as unique identifier

How do you expect to use this method in the future?

Events and Lookups classes are already using it

Not sure about this. I see write to something to write items, if you want to write some kind of metadata why don't you create specific methods for it? I also think Lookups is something really specific to ES but maybe I'm wrong.

I also think that the more parameters we add the more difficult is to test something.

OK, understood

sduenas · 2019-07-09T12:48:12Z

citadel/collections/events.py

+        self.timeframe = timeframe
+        self.base_index = base_index
+
+    def index_name(self):


This is still ElasticSearch specific. At this level I think we should avoiding this. I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

You have just described the StorageEngine class.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?

Who is going to set the index name or you plan to create it in a total automatic way?

...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

You have just described the StorageEngine class.

Not really. As StorageEngine is now, you have to be explicit saying where you want to store your items. I, as developer/user, don't want to decide this. The library should decide how to do this and what's the best way to do it.
I just want to store data and retrieve data.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

* Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?

We can build and we should. My point is I don't want to expose the concept on index, name selection an so on. The system should do that for me.

* Who is going to set the index name or you plan to create it in a total automatic way?

The library should do that. The developer/user can define a prefix if we want, but nothing else.

The idea about all of this is to have different levels of abstraction. The lower level should do what StorageEngine is doing now. An upper level, should take the items and decide where and how to store them. This level can be integrated in the StorageEngine, but I see two different levels of abstraction. Maybe the current StorageEngine should be something private, so each developer who wants a new system should create it by himself or herself.

sduenas · 2019-07-09T12:49:50Z

citadel/collections/events.py

+
+    def index_name(self):
+
+        def __set_timeframe_format():


This means we are storing that depending on when it was retrieved. Another possibility is to store data depending on when they were updated/created. That would spread the items among indexes.

We discussed this offline, before working on the PR.

Do you want to change the way of storing items? let me know and I'll change the code as you prefer.

I'm just wondering what solution will be the best. I like the idea you proposed of storing data as they were events. Taking that into account I'd like to understand which solution is better.

Your current solution is to store everything with the date we're getting it. That means we can have a huge index when we start analyzing a project. Later, indexes will be smaller. The good thing is it's fast to store data, and we can access fast to the information we stored at certain point of time.

The other solution is to store data when it was updated in the origin. The good thing is it makes searching faster within a date range. You can configure shards to give more resources and make fast searches to all those indexes within a range (for example, the last two years). It also follows better the approach to store events as they are gathered. The big problem is we need something to route items to their right index (we can have indexes per year and month like in gharchive) which makes the system slower when writing data.

Any solution will be fine. I just want to think the pros and cons before implementing it.

sduenas · 2019-07-09T12:50:13Z

citadel/collections/events.py

+TIMEFRAMES = [BY_MINUTE, BY_HOUR, BY_DAY, BY_MONTH]
+
+
+class Events:


Why is this called Events?

We discussed this offline, before working on the PR.

It is called Events because we eventize the Perceval items. You can find more details in the description of approach A:

The approach A consists in writing Perceval items to an index, leaving ElasticSearch the responsability to assign unique identifiers, thus Perceval items with the same `uuid` are indexed more than once.

I know but the name does it make sense to me. Is this a list of events? It also shouldn't be in plural.

sduenas · 2019-07-09T12:50:49Z

citadel/collections/lookups.py

+from citadel.storage_engines.elasticsearch import ElasticsearchStorage
+
+
+class Lookups:


What's the purpose of this class?

We discussed this offline, before working on the PR.

The lookups index keeps the last value of the perceval items inserted in the events index. It is used in Approach B:

The approach B extends the approach A by keeping the metadata information within a separated index in order to know the latest time information (i.e., `metadata_timestamp`, `metadata_updated_on`) of a given Perceval item. As can be seen from the results, this extra step decreases the performance of around 20%.

I thought we weren't going to implement this in this step. Anyway, why this needs a class and why not integrate it with the StorageEngine?

I put Events and Lookups outside since they were more POC to evaluate different approaches

valeriocos · 2019-07-10T14:24:27Z

I find pretty difficult to work on this task. Please @sduenas let me know how you want things implemented and I'll try address your requirements

sduenas · 2019-07-10T17:22:10Z

I find pretty difficult to work on this task. Please @sduenas let me know how you want things implemented and I'll try address your requirements

No worries. Have a good time the next two days and we talk when you are back!

valeriocos added 6 commits July 5, 2019 18:12

[elasticsearch] Remove Perceval mapping

20ce3c1

[eleasticsearch] Add set_alias method

04d3b46

This code includes the method `set_alias`, which allows to set an alias on a target index. Tests have been added accordingly.

[collections] Add events collection

ed41bd7

This code eventizes Perceval items, which are stored in a collection leveraging on the ElasticSearch storage engine. Tests have been added accordingly. Signed-off-by: Valerio Cosentino <[email protected]>

[collections] Add lookups collection

bbfee36

Signed-off-by: Valerio Cosentino <[email protected]>

[bin] Share performance tests and results

9c98ec6

Signed-off-by: Valerio Cosentino <[email protected]>

sduenas requested changes Jul 9, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[citadel] Add events (WIP) #5

[citadel] Add events (WIP) #5

valeriocos commented Jul 5, 2019 •

edited

Loading

valeriocos commented Jul 5, 2019 •

edited

Loading

sduenas left a comment

sduenas Jul 9, 2019

valeriocos Jul 10, 2019

sduenas Jul 10, 2019

sduenas Jul 10, 2019

valeriocos Jul 10, 2019

sduenas Jul 9, 2019

valeriocos Jul 10, 2019

sduenas Jul 10, 2019

sduenas Jul 9, 2019

valeriocos Jul 10, 2019

sduenas Jul 10, 2019

sduenas Jul 9, 2019

valeriocos Jul 10, 2019

sduenas Jul 10, 2019

sduenas Jul 9, 2019

valeriocos Jul 10, 2019

sduenas Jul 10, 2019

valeriocos Jul 10, 2019

valeriocos commented Jul 10, 2019

sduenas commented Jul 10, 2019

		TIMEFRAMES = [BY_MINUTE, BY_HOUR, BY_DAY, BY_MONTH]


		class Events:

		from citadel.storage_engines.elasticsearch import ElasticsearchStorage


		class Lookups:

[citadel] Add events (WIP) #5

Are you sure you want to change the base?

[citadel] Add events (WIP) #5

Conversation

valeriocos commented Jul 5, 2019 • edited Loading

valeriocos commented Jul 5, 2019 • edited Loading

sduenas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriocos commented Jul 10, 2019

sduenas commented Jul 10, 2019

valeriocos commented Jul 5, 2019 •

edited

Loading

valeriocos commented Jul 5, 2019 •

edited

Loading