forked from rkalla/common-parser-lib
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
176 lines (129 loc) · 7.26 KB
/
README
File metadata and controls
176 lines (129 loc) · 7.26 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
The Buzz Media common-parser-lib
http://www.thebuzzmedia.com/software/common-parser-lib-common-parser-java-utility-library/
Changelog
---------
3.0
* Refactored library under base "parser" package to keep integration with
future Buzz Media "common" libraries cleaner.
* Decoupled source type, input type, delimiter type and Token value types
all from each other. Allows for much more flexible API definitions.
* Generified types were renamed to follow the naming scheme:
<IT>: input-type, the type of the input that is processed by the parser.
<ST>: source-type, type of the "source" that a Token gets its value from.
<TT>: token-type, if needed, the type returned by IToken.getType()
<VT>: value-type, the return type of IToken.getValue()
<DT>: delimiter-type, the type of the delims used by IDelimitedTokenizer.
<ET>: event-type, the type of the event pull parser's return.
* AbstractParser.refillBuffer was refactored and generalized so code is no
longer duplicated into different base parser implementations.
* AbstractParser.createBuffer hook was added to simplify parser initialized
in subclasses.
* AbstractParser.parseToken was added as the universal logic-loop used to
parse the next token from the given bIndex/bEndIndex range of chars in the
buffer. This method also handles refilling and re-trying the parse operation
on first-failure before automatically stopping the parser and returning
null.
This logic is universal and doesn't need to be duplicated in any other sub
classes; the only thing subclasses need is the actual scanning/marking logic
for the types of tokens they parse.
* Added AbstractReusableToken to make creating/using reusable tokens simpler.
* Moved ByteArrayToken and CharArrayToken into their respective ITokenizer
implementations. These are not general-use tokens except for these specific
ITokenizer implementation, so it made more sense to have them defined as
part of the tokenizer itself.
* IContainerToken definition was fixed to extend IToken.
* IContainerToken added the ability to dictate a bounds-growth mode based
on the child tokens added to it.
* Javadoc added to all the core interfaces to help make understanding the API
(starting with core classes) easier.
2.0
* Added parser package containing the fundamentals of a callback-based parser.
* Added IStreamParser to spec an InputStream parser processing bytes
* Added IReaderParser to spec a Reader parser processing chars.
* Added base abstract implementations for all parser types that include
boiler-plate logic like refilling the underlying read buffer from the stream
and passing generated tokens to the callback. Implementors need only add the
actual parsing logic.
* IContainerToken was added and contain 0 or more child IToken instances.
* AbstractContainerToken is a base implementation for IContainerToken and
provides the following behavior:
* Parent's length automatically expands to contain child tokens as they
are added.
* Child token bounds are vetted to ensure they are allowed within the
parent's bounds when added.
* ITokenizer.isReuseToken/setReuseToken was added for performance-minded
use; it instructs the tokenizer impl to update and return the same IToken
instance every call to nextToken instead of creating a new object each time.
This offers a performance boost and smaller memory footprint for implementors
that know the token instance will be ephemeral and don't try and hold on to
it.
* ITokenizer was generalized to be a more flexible base-interface for ANY
kind of tokenizer; not just a delimiter-based tokenizer which it was
previously.
* IDelimitedTokenizer now contained specific delimiter-based specification.
* Added IScanner interface used to define a base scanner implementation.
* IToken was moved to the common base package as it applies to all the
parsers.
1.1
* Initial public release.
License
-------
This library is released under the Apache 2 License. See LICENSE.
Description
-----------
A collection of interfaces and base implementations for classes that deal with
parsing.
The goal of this library is provide a clean API with well-defined behaviors for
different types of parsers to make custom parser implementation quick and easy.
Base abstract implementations are provided, where applicable, to make extension
and implementation straight forward.
The 3 types of parsers and intended use are:
* Scanner: A stateless set of logic used to scan a portion of data,
extracting a series of IToken<T> objects describing the content and returning
them to the caller. Scanners are meant to be thread-safe as they are
stateless.
* Tokenizer: A stateful class used to wrap a source of data and be invoked
by the caller in a while-loop, pulling nextToken() from the tokenizer until
it returns null; indicating the data has been exhausted. Tokenizers are
meant to be reusable, but NOT thread-safe as they maintain state relative to
the content they are parsing.
* Parser: A stateful class used to parse IToken<T> instances from the given
source and notify a callback every time a new token is parsed. Sources can
be InputStreams (byte[]) or Readers (char[]). Parsers are meant to be
reusable, like tokenizers, but NOT thread-safe as they maintain state about
their source content as well.
There are no base implementations for scanners are they are simple yet
highly-specialized; any implementor can implement IScanner and fill in his own
logic.
There are base abstract and concrete implementations for ITokenizer in the form
of a simple delimiter-based tokenizer. Base abstract implementations provide
all the boiler plate necessary to run through the given data source and
implementors only need to provide the parsing logic.
Base abstract implementations for IParser are available as well, providing
optimized boilerplate for tasks like refilling read buffers from the underlying
streams, vetting arguments and providing hooks for the actual parse logic to
implementors.
Runtime Requirements
--------------------
1. The Buzz Media common-lib (tbm-common-lib-<VER>.jar)
http://www.thebuzzmedia.com/software/common-lib-common-java-utility-library/
History
-------
While developing the CloudFront Log Parser (http://www.thebuzzmedia.com/software/cloudfront-log-parser/),
High Performance XML Parser (http://www.thebuzzmedia.com/software/high-performance-java-xml-parser-hpjxp/)
and Redis Client Driver (http://www.thebuzzmedia.com/software/redis-java-client-db-driver/)
I began to see a lot of duplication in parser/scanner/lexer logic emerge.
It took about a week to normalize all the use-cases into what looked like 3
different approaches to parsing: Scanning, tokenizing and parsing to a callback.
In an attempt to normalize my efforts across all these projects and come up with
a clean API that I can follow now and in the future, this general-use project
was born.
My goals for this project were not so much base implementations as they were:
* Well-defined structure to follow
* Solid, performant base implementations for common boilerplate.
So I set out to define this library providing a clean API that can be easily
extended to make custom parser implementations quickly.
Contact
-------
If you have questions, comments or bug reports for this software please contact
us at: [email protected]