-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[citadel] Add events (WIP) #5
base: master
Are you sure you want to change the base?
Conversation
This code enhances the write method to include the item_type and field_id params, the former defines the type of items stored, while the later allows to point to the field used as unique identifier to store items. Signed-off-by: Valerio Cosentino <[email protected]>
This code includes the method `set_alias`, which allows to set an alias on a target index. Tests have been added accordingly.
This code eventizes Perceval items, which are stored in a collection leveraging on the ElasticSearch storage engine. Tests have been added accordingly. Signed-off-by: Valerio Cosentino <[email protected]>
Signed-off-by: Valerio Cosentino <[email protected]>
Signed-off-by: Valerio Cosentino <[email protected]>
Results and discussions are also reported below:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my comments.
@@ -130,12 +104,15 @@ def create_index(self, index_name, mapping): | |||
logger.error(msg) | |||
raise StorageEngineError(cause=msg) | |||
|
|||
def write(self, resource, data, chunk_size=CHUNK_SIZE): | |||
def write(self, resource, data, item_type=ITEMS, chunk_size=CHUNK_SIZE, field_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these new parameters are needed? How do you expect to use this method in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these new parameters are needed?
item_type
can be used to flag or filter datafield_id
can be useful to point which attribute should be considered as unique identifier
How do you expect to use this method in the future?
- Events and Lookups classes are already using it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. I see write
to something to write items
, if you want to write some kind of metadata why don't you create specific methods for it? I also think Lookups
is something really specific to ES but maybe I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the more parameters we add the more difficult is to test something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, understood
self.timeframe = timeframe | ||
self.base_index = base_index | ||
|
||
def index_name(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still ElasticSearch specific. At this level I think we should avoiding this. I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.
If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.
You have just described the StorageEngine class.
If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.
-
Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?
-
Who is going to set the index name or you plan to create it in a total automatic way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...I think it should be an abstract class and each specific class will implement the methods needed to write and route where the items will be written.
You have just described the StorageEngine class.
Not really. As StorageEngine
is now, you have to be explicit saying where you want to store your items. I, as developer/user, don't want to decide this. The library should decide how to do this and what's the best way to do it.
I just want to store data and retrieve data.
If that creates a lot of leves of abstraction, all of this can be moved to the StorageEnging class. That would work too, but we have to take into account that should be independent and without public references to ES.
* Why we cannot build on top of the elasticsearch storage engine (which already hides some specificities of dealing with ES)?
We can build and we should. My point is I don't want to expose the concept on index, name selection an so on. The system should do that for me.
* Who is going to set the index name or you plan to create it in a total automatic way?
The library should do that. The developer/user can define a prefix if we want, but nothing else.
The idea about all of this is to have different levels of abstraction. The lower level should do what StorageEngine
is doing now. An upper level, should take the items and decide where and how to store them. This level can be integrated in the StorageEngine
, but I see two different levels of abstraction. Maybe the current StorageEngine
should be something private, so each developer who wants a new system should create it by himself or herself.
|
||
def index_name(self): | ||
|
||
def __set_timeframe_format(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we are storing that depending on when it was retrieved. Another possibility is to store data depending on when they were updated/created. That would spread the items among indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, before working on the PR.
Do you want to change the way of storing items? let me know and I'll change the code as you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering what solution will be the best. I like the idea you proposed of storing data as they were events. Taking that into account I'd like to understand which solution is better.
Your current solution is to store everything with the date we're getting it. That means we can have a huge index when we start analyzing a project. Later, indexes will be smaller. The good thing is it's fast to store data, and we can access fast to the information we stored at certain point of time.
The other solution is to store data when it was updated in the origin. The good thing is it makes searching faster within a date range. You can configure shards to give more resources and make fast searches to all those indexes within a range (for example, the last two years). It also follows better the approach to store events as they are gathered. The big problem is we need something to route items to their right index (we can have indexes per year and month like in gharchive) which makes the system slower when writing data.
Any solution will be fine. I just want to think the pros and cons before implementing it.
TIMEFRAMES = [BY_MINUTE, BY_HOUR, BY_DAY, BY_MONTH] | ||
|
||
|
||
class Events: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called Events
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, before working on the PR.
It is called Events
because we eventize the Perceval items. You can find more details in the description of approach A:
The approach A consists in writing Perceval items to an index, leaving ElasticSearch the responsability
to assign unique identifiers, thus Perceval items with the same `uuid` are indexed more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know but the name does it make sense to me. Is this a list of events? It also shouldn't be in plural.
from citadel.storage_engines.elasticsearch import ElasticsearchStorage | ||
|
||
|
||
class Lookups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, before working on the PR.
The lookups index keeps the last value of the perceval items inserted in the events index. It is used in Approach B:
The approach B extends the approach A by keeping the metadata information within a separated index in order
to know the latest time information (i.e., `metadata_timestamp`, `metadata_updated_on`) of a given Perceval item.
As can be seen from the results, this extra step decreases the performance of around 20%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we weren't going to implement this in this step. Anyway, why this needs a class and why not integrate it with the StorageEngine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put Events and Lookups outside since they were more POC to evaluate different approaches
I find pretty difficult to work on this task. Please @sduenas let me know how you want things implemented and I'll try address your requirements |
No worries. Have a good time the next two days and we talk when you are back! |
This PR proposes an implementation to store items coming from Perceval. In a nutshell the approach leverages on one of the two scenarios designed for ElasticSearch which is time series data. Thus, Perceval items are considered as events and stored in an index (which is aliased). The responsability to
assign unique identitfiers to such items is delegated to ElasticSearch, thus Perceval items with the same
uuid
are indexed more than once.Note that this is WIP, refinements on code/comments may be the target of further iterations.
Different approaches have been tried, they are presented in commit: [bin] Share performance tests and results. Commit [collections] Add lookups collection is part of one of the approaches evaluated.