common-parser-lib/README at master · revelfire/common-parser-lib · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
The Buzz Media common-parser-lib
http://www.thebuzzmedia.com/software/common-parser-lib-common-parser-java-utility-library/


Changelog
---------
3.0
	* Refactored library under base "parser" package to keep integration with
	future Buzz Media "common" libraries cleaner.

	* Decoupled source type, input type, delimiter type and Token value types
	all from each other. Allows for much more flexible API definitions.

	* Generified types were renamed to follow the naming scheme:
	  <IT>: input-type, the type of the input that is processed by the parser.
	  <ST>: source-type, type of the "source" that a Token gets its value from.
	  <TT>: token-type, if needed, the type returned by IToken.getType()
	  <VT>: value-type, the return type of IToken.getValue()
	  <DT>: delimiter-type, the type of the delims used by IDelimitedTokenizer.
	  <ET>: event-type, the type of the event pull parser's return.

	* AbstractParser.refillBuffer was refactored and generalized so code is no
	longer duplicated into different base parser implementations.

	* AbstractParser.createBuffer hook was added to simplify parser initialized
	in subclasses.

	* AbstractParser.parseToken was added as the universal logic-loop used to
	parse the next token from the given bIndex/bEndIndex range of chars in the
	buffer. This method also handles refilling and re-trying the parse operation
	on first-failure before automatically stopping the parser and returning
	null.

	This logic is universal and doesn't need to be duplicated in any other sub
	classes; the only thing subclasses need is the actual scanning/marking logic
	for the types of tokens they parse.

	* Added AbstractReusableToken to make creating/using reusable tokens simpler.

	* Moved ByteArrayToken and CharArrayToken into their respective ITokenizer
	implementations. These are not general-use tokens except for these specific
	ITokenizer implementation, so it made more sense to have them defined as
	part of the tokenizer itself.

	* IContainerToken definition was fixed to extend IToken.

	* IContainerToken added the ability to dictate a bounds-growth mode based
	on the child tokens added to it.

	* Javadoc added to all the core interfaces to help make understanding the API
	(starting with core classes) easier.

2.0
	* Added parser package containing the fundamentals of a callback-based parser.

	* Added IStreamParser to spec an InputStream parser processing bytes

	* Added IReaderParser to spec a Reader parser processing chars.

	* Added base abstract implementations for all parser types that include
	boiler-plate logic like refilling the underlying read buffer from the stream
	and passing generated tokens to the callback. Implementors need only add the
	actual parsing logic.

	* IContainerToken was added and contain 0 or more child IToken instances.

	* AbstractContainerToken is a base implementation for IContainerToken and
	provides the following behavior:
		* Parent's length automatically expands to contain child tokens as they
		are added.
		* Child token bounds are vetted to ensure they are allowed within the
		parent's bounds when added.

	* ITokenizer.isReuseToken/setReuseToken was added for performance-minded
	use; it instructs the tokenizer impl to update and return the same IToken
	instance every call to nextToken instead of creating a new object each time.
	This offers a performance boost and smaller memory footprint for implementors
	that know the token instance will be ephemeral and don't try and hold on to
	it.

	* ITokenizer was generalized to be a more flexible base-interface for ANY
	kind of tokenizer; not just a delimiter-based tokenizer which it was
	previously.

	* IDelimitedTokenizer now contained specific delimiter-based specification.

	* Added IScanner interface used to define a base scanner implementation.

	* IToken was moved to the common base package as it applies to all the
	parsers.

1.1
	* Initial public release.


License
-------
This library is released under the Apache 2 License. See LICENSE.


Description
-----------
A collection of interfaces and base implementations for classes that deal with
parsing.

The goal of this library is provide a clean API with well-defined behaviors for
different types of parsers to make custom parser implementation quick and easy.
Base abstract implementations are provided, where applicable, to make extension
and implementation straight forward.

The 3 types of parsers and intended use are:
	* Scanner: A stateless set of logic used to scan a portion of data,
	extracting a series of IToken<T> objects describing the content and returning
	them to the caller.	Scanners are meant to be thread-safe as they are
	stateless.

	* Tokenizer: A stateful class used to wrap a source of data and be invoked
	by the caller in a while-loop, pulling nextToken() from the tokenizer until
	it returns null; indicating the data has been exhausted. Tokenizers are
	meant to be reusable, but NOT thread-safe as they maintain state relative to
	the content they are parsing.

	* Parser: A stateful class used to parse IToken<T> instances from the given
	source and notify a callback every time a new token is parsed. Sources can
	be InputStreams (byte[]) or Readers (char[]). Parsers are meant to be
	reusable, like tokenizers, but NOT thread-safe as they maintain state about
	their source content as well.

There are no base implementations for scanners are they are simple yet
highly-specialized; any implementor can implement IScanner and fill in his own
logic.

There are base abstract and concrete implementations for ITokenizer in the form
of a simple delimiter-based tokenizer. Base abstract implementations provide
all the boiler plate necessary to run through the given data source and
implementors only need to provide the parsing logic.

Base abstract implementations for IParser are available as well, providing
optimized boilerplate for tasks like refilling read buffers from the underlying
streams, vetting arguments and providing hooks for the actual parse logic to
implementors.


Runtime Requirements
--------------------
1.	The Buzz Media common-lib (tbm-common-lib-<VER>.jar)
	http://www.thebuzzmedia.com/software/common-lib-common-java-utility-library/


History
-------
While developing the CloudFront Log Parser (http://www.thebuzzmedia.com/software/cloudfront-log-parser/),
High Performance XML Parser (http://www.thebuzzmedia.com/software/high-performance-java-xml-parser-hpjxp/)
and Redis Client Driver (http://www.thebuzzmedia.com/software/redis-java-client-db-driver/)
I began to see a lot of duplication in parser/scanner/lexer logic emerge.

It took about a week to normalize all the use-cases into what looked like 3
different approaches to parsing: Scanning, tokenizing and parsing to a callback.

In an attempt to normalize my efforts across all these projects and come up with
a clean API that I can follow now and in the future, this general-use project
was born.

My goals for this project were not so much base implementations as they were:

	* Well-defined structure to follow
	* Solid, performant base implementations for common boilerplate.

So I set out to define this library providing a clean API that can be easily
extended to make custom parser implementations quickly.


Contact
-------
If you have questions, comments or bug reports for this software please contact
us at: [email protected]