Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[citadel] Add events (WIP) #5

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

valeriocos
Copy link
Contributor

@valeriocos valeriocos commented Jul 5, 2019

This PR proposes an implementation to store items coming from Perceval. In a nutshell the approach leverages on one of the two scenarios designed for ElasticSearch which is time series data. Thus, Perceval items are considered as events and stored in an index (which is aliased). The responsability to
assign unique identitfiers to such items is delegated to ElasticSearch, thus Perceval items with the same uuid are indexed more than once.

Note that this is WIP, refinements on code/comments may be the target of further iterations.

Different approaches have been tried, they are presented in commit: [bin] Share performance tests and results. Commit [collections] Add lookups collection is part of one of the approaches evaluated.

This code enhances the write method to include the
item_type and field_id params, the former defines
the type of items stored, while the later allows
to point to the field used as unique identifier
to store items.

Signed-off-by: Valerio Cosentino <[email protected]>
This code includes the method `set_alias`, which allows
to set an alias on a target index. Tests have been added
accordingly.
This code eventizes Perceval items, which are stored
in a collection leveraging on the ElasticSearch storage
engine.

Tests have been added accordingly.

Signed-off-by: Valerio Cosentino <[email protected]>
Signed-off-by: Valerio Cosentino <[email protected]>
@valeriocos
Copy link
Contributor Author

valeriocos commented Jul 5, 2019

Results and discussions are also reported below:

- Results details:

# A) events index write-only
Repo https://github.com/elastic/elasticsearch processed: time 00:04:43, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:44, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:28, events 85190

# B) events index write-only, lookups write-and-update
Repo https://github.com/elastic/elasticsearch processed: time 00:05:51, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:05:27, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:05:23, events 85190

# C) events index write-and-update (no items mod)
Repo https://github.com/elastic/elasticsearch processed: time 00:04:32, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:30, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:29, events 85190

# D) events index write-and-update (items mod)
Repo https://github.com/elastic/elasticsearch processed: time 00:05:06, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:53, events 85190
Repo https://github.com/elastic/elasticsearch processed: time 00:04:46, events 85190


- Discussions:
4 tests have been conducted to evaluate approaches to save data to ElasticSearch, they are described below.

The approach A consists in writing Perceval items to an index, leaving ElasticSearch the responsability
to assign unique identifiers, thus Perceval items with the same `uuid` are indexed more than once.

The approach B extends the approach A by keeping the metadata information within a separated index in order
to know the latest time information (i.e., `metadata_timestamp`, `metadata_updated_on`) of a given Perceval item.
As can be seen from the results, this extra step decreases the performance of around 20%.

The approach C consists in writing Perceval items to an index, however the unique identifiers are set using
the Perceval `uuid` values. Thus, the operations performed on the index concern writing and updates, since
the same item may be retrieved several times. In this case the performance is better then approach A, since
the documents used to test this approach are not modified, thus it is possible that Lucene performs some
optimization and do notreindexed them again.

The approach D is similar to the approach C, but the documents are modified. As can be seen, this approach
performs worse than the approach A (around 6%)

Copy link
Member

@sduenas sduenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments.

@@ -130,12 +104,15 @@ def create_index(self, index_name, mapping):
logger.error(msg)
raise StorageEngineError(cause=msg)

def write(self, resource, data, chunk_size=CHUNK_SIZE):
def write(self, resource, data, item_type=ITEMS, chunk_size=CHUNK_SIZE, field_id=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these new parameters are needed? How do you expect to use this method in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these new parameters are needed?

  • item_type can be used to flag or filter data
  • field_id can be useful to point which attribute should be considered as unique identifier

How do you expect to use this method in the future?

  • Events and Lookups classes are already using it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. I see write to something to write items, if you want to write some kind of metadata why don't you create specific methods for it? I also think Lookups is something really specific to ES but maybe I'm wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the more parameters we add the more difficult is to test something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, understood

self.timeframe = timeframe
self.base_index = base_index

def index_name(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still ElasticSearch specific. At this level I think we should avoiding this. I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

You have just described the StorageEngine class.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

  • Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?

  • Who is going to set the index name or you plan to create it in a total automatic way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.

You have just described the StorageEngine class.

Not really. As StorageEngine is now, you have to be explicit saying where you want to store your items. I, as developer/user, don't want to decide this. The library should decide how to do this and what's the best way to do it.
I just want to store data and retrieve data.

If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.

* Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?

We can build and we should. My point is I don't want to expose the concept on index, name selection an so on. The system should do that for me.

* Who is going to set the index name or you plan to create it in a total automatic way?

The library should do that. The developer/user can define a prefix if we want, but nothing else.

The idea about all of this is to have different levels of abstraction. The lower level should do what StorageEngine is doing now. An upper level, should take the items and decide where and how to store them. This level can be integrated in the StorageEngine, but I see two different levels of abstraction. Maybe the current StorageEngine should be something private, so each developer who wants a new system should create it by himself or herself.


def index_name(self):

def __set_timeframe_format():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we are storing that depending on when it was retrieved. Another possibility is to store data depending on when they were updated/created. That would spread the items among indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, before working on the PR.

Do you want to change the way of storing items? let me know and I'll change the code as you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering what solution will be the best. I like the idea you proposed of storing data as they were events. Taking that into account I'd like to understand which solution is better.

Your current solution is to store everything with the date we're getting it. That means we can have a huge index when we start analyzing a project. Later, indexes will be smaller. The good thing is it's fast to store data, and we can access fast to the information we stored at certain point of time.

The other solution is to store data when it was updated in the origin. The good thing is it makes searching faster within a date range. You can configure shards to give more resources and make fast searches to all those indexes within a range (for example, the last two years). It also follows better the approach to store events as they are gathered. The big problem is we need something to route items to their right index (we can have indexes per year and month like in gharchive) which makes the system slower when writing data.

Any solution will be fine. I just want to think the pros and cons before implementing it.

TIMEFRAMES = [BY_MINUTE, BY_HOUR, BY_DAY, BY_MONTH]


class Events:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called Events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, before working on the PR.

It is called Events because we eventize the Perceval items. You can find more details in the description of approach A:

The approach A consists in writing Perceval items to an index, leaving ElasticSearch the responsability
to assign unique identifiers, thus Perceval items with the same `uuid` are indexed more than once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know but the name does it make sense to me. Is this a list of events? It also shouldn't be in plural.

from citadel.storage_engines.elasticsearch import ElasticsearchStorage


class Lookups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, before working on the PR.

The lookups index keeps the last value of the perceval items inserted in the events index. It is used in Approach B:

The approach B extends the approach A by keeping the metadata information within a separated index in order
to know the latest time information (i.e., `metadata_timestamp`, `metadata_updated_on`) of a given Perceval item.
As can be seen from the results, this extra step decreases the performance of around 20%.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't going to implement this in this step. Anyway, why this needs a class and why not integrate it with the StorageEngine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put Events and Lookups outside since they were more POC to evaluate different approaches

@valeriocos
Copy link
Contributor Author

I find pretty difficult to work on this task. Please @sduenas let me know how you want things implemented and I'll try address your requirements

@sduenas
Copy link
Member

sduenas commented Jul 10, 2019

I find pretty difficult to work on this task. Please @sduenas let me know how you want things implemented and I'll try address your requirements

No worries. Have a good time the next two days and we talk when you are back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants