Skip to content

Commit

Permalink
Improve user-facing docs. (#14251)
Browse files Browse the repository at this point in the history
This improves user-facing docs in Lucene's package javadocs:
 - Make some docs up-to-date, e.g. some of them were still referring to
   oal.index.Fields.
 - More emphasis of structured search and knn search.
 - More guidance on how to index data in `oal.document`.
 - Add API docs for impacts.
 - Discourage less strongly to change the similarity at search time.
 - Recommend using Query factory methods on Field classes instead of
   creating queries manually.
  • Loading branch information
jpountz committed Feb 19, 2025
1 parent 2dedd1d commit 1aa9e75
Show file tree
Hide file tree
Showing 5 changed files with 176 additions and 95 deletions.
61 changes: 56 additions & 5 deletions lucene/core/src/java/org/apache/lucene/document/package-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,61 @@
* make the process of taking a file and converting it into a Lucene {@link
* org.apache.lucene.document.Document}.
*
* <p>The {@link org.apache.lucene.document.DateTools} is a utility class to make dates and times
* searchable. {@link org.apache.lucene.document.IntPoint}, {@link
* org.apache.lucene.document.LongPoint}, {@link org.apache.lucene.document.FloatPoint} and {@link
* org.apache.lucene.document.DoublePoint} enable indexing of numeric values (and also dates) for
* fast range queries using {@link org.apache.lucene.search.PointRangeQuery}
* <h2>How to index ...</h2>
*
* <h3>Strings</h3>
*
* <p>{@link org.apache.lucene.document.TextField} allows indexing tokens from a String so that one
* can perform full-text search on it. The way that the input is tokenized depends on the {@link
* org.apache.lucene.analysis.Analyzer} that is configured on the {@link
* org.apache.lucene.index.IndexWriterConfig}. TextField can also be optionally stored.
*
* <p>{@link org.apache.lucene.document.KeywordField} indexes whole values as a single term so that
* one can perform exact search on it. It also records doc values to enable sorting or faceting on
* this field. Finally, it also supports optionally storing the value.
*
* <p>If faceting or sorting are not required, {@link org.apache.lucene.document.StringField} is a
* variant of {@link org.apache.lucene.document.KeywordField} that does not index doc values.
*
* <h3>Numbers</h3>
*
* <p>If a numeric field represents an identifier rather than a quantity and is more commonly
* searched on single values than on ranges of values, it is generally recommended to index its
* string representation via {@link org.apache.lucene.document.KeywordField} (or {@link
* org.apache.lucene.document.StringField} if doc values are not necessary).
*
* <p>{@link org.apache.lucene.document.LongField}, {@link org.apache.lucene.document.IntField},
* {@link org.apache.lucene.document.DoubleField} and {@link org.apache.lucene.document.FloatField}
* index values in a points index for efficient range queries, and also create doc-values for these
* fields for efficient sorting and faceting.
*
* <p>If the field is aimed at being used to tune the score, {@link
* org.apache.lucene.document.FeatureField} helps internally store numeric data as term frequencies
* in a way that makes it efficient to influence scoring at search time.
*
* <h3>Other types of structured data</h3>
*
* <p>It is recommended to index dates as a {@link org.apache.lucene.document.LongField} that stores
* the number of milliseconds since Epoch.
*
* <p>IP fields can be indexed via {@link org.apache.lucene.document.InetAddressPoint} in addition
* to a {@link org.apache.lucene.document.SortedDocValuesField} (if the field is single-valued) or
* {@link org.apache.lucene.document.SortedSetDocValuesField} that stores the result of {@link
* org.apache.lucene.document.InetAddressPoint#encode}.
*
* <h3>Dense numeric vectors</h3>
*
* <p>Dense numeric vectors can be indexed with {@link
* org.apache.lucene.document.KnnFloatVectorField} if its dimensions are floating-point numbers or
* {@link org.apache.lucene.document.KnnByteVectorField} if its dimensions are bytes. This allows
* searching for nearest neighbors at search time.
*
* <h3>Sparse numeric vectors</h3>
*
* <p>To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension
* of the sparse vector should be indexed as a {@link org.apache.lucene.document.FeatureField}.
* Queries can then be constructed as a {@link org.apache.lucene.search.BooleanQuery} with {@link
* org.apache.lucene.document.FeatureField#newLinearQuery(String, String, float) linear queries} as
* {@link org.apache.lucene.search.BooleanClause.Occur#SHOULD} clauses.
*/
package org.apache.lucene.document;
117 changes: 65 additions & 52 deletions lucene/core/src/java/org/apache/lucene/index/package-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -70,28 +70,33 @@
* <h3>IndexReader</h3>
*
* <p>{@link org.apache.lucene.index.IndexReader} is used to read data from the index, and supports
* searching. Many thread-safe readers may be {@link org.apache.lucene.index.DirectoryReader#open}
* concurrently with a single (or no) writer. Each reader maintains a consistent "point in time"
* view of an index and must be explicitly refreshed (see {@link
* org.apache.lucene.index.DirectoryReader#openIfChanged}) in order to incorporate writes that may
* occur after it is opened. <a id="segments"></a>
* searching. Many thread-safe readers may be {@link org.apache.lucene.index.DirectoryReader#open
* open} concurrently with a single (or no) writer. Each reader maintains a consistent "point in
* time" view of an index and must be explicitly refreshed (see {@link
* org.apache.lucene.index.DirectoryReader#openIfChanged(DirectoryReader, IndexWriter)}) in order to
* incorporate writes that may occur after it is opened. <a id="segments"></a>
*
* <h3>Segments and docids</h3>
*
* <p>Lucene's index is composed of segments, each of which contains a subset of all the documents
* in the index, and is a complete searchable index in itself, over that subset. As documents are
* written to the index, new segments are created and flushed to directory storage. Segments are
* immutable; updates and deletions may only create new segments and do not modify existing ones.
* Over time, the writer merges groups of smaller segments into single larger ones in order to
* composed of an immutable core and per-commit live documents and doc-value updates. Insertions add
* new segments. Deletions and doc-value updates in a given segment create a new segment that shares
* the same core as the previous segment and new live docs for this segment. Updates are implemented
* as an atomic insertion and deletion.
*
* <p>Over time, the writer merges groups of smaller segments into single larger ones in order to
* maintain an index that is efficient to search, and to reclaim dead space left behind by deleted
* (and updated) documents.
*
* <p>Each document is identified by a 32-bit number, its "docid," and is composed of a collection
* of Field values of diverse types (postings, stored fields, doc values, and points). Docids come
* in two flavors: global and per-segment. A document's global docid is just the sum of its
* per-segment docid and that segment's base docid offset. External, high-level APIs only handle
* global docids, but internal APIs that reference a {@link org.apache.lucene.index.LeafReader},
* which is a reader for a single segment, deal in per-segment docids.
* of Field values of diverse types (postings, stored fields, term vectors, doc values, points and
* knn vectors). Docids come in two flavors: global and per-segment. A document's global docid is
* just the sum of its per-segment docid and that segment's base docid offset. External, high-level
* APIs only handle global docids, but internal APIs that reference a {@link
* org.apache.lucene.index.LeafReader}, which is a reader for a single segment, deal in per-segment
* docids.
*
* <p>Docids are assigned sequentially within each segment (starting at 0). Thus the number of
* documents in a segment is the same as its maximum docid; some may be deleted, but their docids
Expand All @@ -117,45 +122,31 @@
* values given a docid. All stored field values for a document are stored together in a block.
* Different types of stored field provide high-level datatypes such as strings and numbers on top
* of the underlying bytes. Stored field values are usually retrieved by the searcher using an
* implementation of {@link org.apache.lucene.index.StoredFieldVisitor}. <a id="docvalues"></a>
* implementation of {@link org.apache.lucene.index.StoredFieldVisitor}. <a id="termvectors"></a>
*
* <p>{@link org.apache.lucene.index.TermVectors} store a per-document inverted index. They are
* useful for finding similar documents, called MoreLikeThis in Lucene. <a id="docvalues"></a>
*
* <p>{@link org.apache.lucene.index.DocValues} fields are what are sometimes referred to as
* columnar, or column-stride fields, by analogy to relational database terminology, in which
* documents are considered as rows, and fields, columns. DocValues fields store values per-field: a
* value for every document is held in a single data structure, providing for rapid, sequential
* lookup of a field-value given a docid. These fields are used for efficient value-based sorting,
* and for faceting, but they are not useful for filtering. <a id="points"></a>
* for faceting, and sometimes for filtering on the least selective clauses of a query. <a
* id="points"></a>
*
* <p>{@link org.apache.lucene.index.PointValues} represent numeric values using a kd-tree data
* structure. Efficient 1- and higher dimensional implementations make these the choice for numeric
* range and interval queries, and geo-spatial queries. <a id="postings"></a>
*
* <h2>Postings APIs</h2>
*
* <a id="fields"></a>
* range and interval queries, and geo-spatial queries. <a id="knnvectors"></a>
*
* <h3>Fields </h3>
* <p>{@link org.apache.lucene.index.KnnVectorValues} represent dense numeric vectors whose
* dimensions may either be bytes or floats. They are indexed in a way that allows searching for
* nearest neighbors. The vectors are typically produced by a machine-learned model, and used to
* perform semantic search.
*
* <p>{@link org.apache.lucene.index.Fields} is the initial entry point into the postings APIs, this
* can be obtained in several ways:
* <p><a id="postings"></a>
*
* <pre class="prettyprint">
* // access indexed fields for an index segment
* Fields fields = reader.fields();
* // access term vector fields for a specified document
* TermVectors vectors = reader.termVectors();
* Fields fields = vectors.get(docid);
* </pre>
*
* Fields implements Java's Iterable interface, so it's easy to enumerate the list of fields:
*
* <pre class="prettyprint">
* // enumerate list of fields
* for (String field : fields) {
* // access the terms for this field
* Terms terms = fields.terms(field);
* }
* </pre>
* <h2>Postings APIs</h2>
*
* <a id="terms"></a>
*
Expand All @@ -165,15 +156,16 @@
* exposes some metadata and <a href="#fieldstats">statistics</a>, and an API for enumeration.
*
* <pre class="prettyprint">
* Terms terms = leafReader.terms("body");
* // metadata about the field
* System.out.println("positions? " + terms.hasPositions());
* System.out.println("offsets? " + terms.hasOffsets());
* System.out.println("payloads? " + terms.hasPayloads());
* // iterate through terms
* TermsEnum termsEnum = terms.iterator(null);
* TermsEnum termsEnum = terms.iterator();
* BytesRef term = null;
* while ((term = termsEnum.next()) != null) {
* doSomethingWith(termsEnum.term());
* doSomethingWith(term);
* }
* </pre>
*
Expand All @@ -188,9 +180,9 @@
* // get the document frequency
* System.out.println(termsEnum.docFreq());
* // enumerate through documents
* PostingsEnum docs = termsEnum.postings(null, null);
* PostingsEnum docs = termsEnum.postings(null);
* // enumerate through documents and positions
* PostingsEnum docsAndPositions = termsEnum.postings(null, null, PostingsEnum.FLAG_POSITIONS);
* PostingsEnum docsAndPositions = termsEnum.postings(null, PostingsEnum.POSITIONS);
* }
* </pre>
*
Expand All @@ -199,40 +191,62 @@
* <h3>Documents </h3>
*
* <p>{@link org.apache.lucene.index.PostingsEnum} is an extension of {@link
* org.apache.lucene.search.DocIdSetIterator}that iterates over the list of documents for a term,
* org.apache.lucene.search.DocIdSetIterator} that iterates over the list of documents for a term,
* along with the term frequency within that document.
*
* <pre class="prettyprint">
* int docid;
* while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
* System.out.println(docid);
* System.out.println(docsEnum.freq());
* }
* }
* </pre>
*
* <a id="positions"></a>
*
* <h3>Positions </h3>
* <h3>Positions</h3>
*
* <p>PostingsEnum also allows iteration of the positions a term occurred within the document, and
* any additional per-position information (offsets and payload). The information available is
* controlled by flags passed to TermsEnum#postings
*
* <pre class="prettyprint">
* int docid;
* PostingsEnum postings = termsEnum.postings(null, null, PostingsEnum.FLAG_PAYLOADS | PostingsEnum.FLAG_OFFSETS);
* PostingsEnum postings = termsEnum.postings(null, PostingsEnum.PAYLOADS | PostingsEnum.OFFSETS);
* while ((docid = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
* System.out.println(docid);
* int freq = postings.freq();
* for (int i = 0; i &lt; freq; i++) {
* System.out.println(postings.nextPosition());
* System.out.println(postings.startOffset());
* System.out.println(postings.endOffset());
* System.out.println(postings.getPayload());
* System.out.println(postings.nextPosition());
* System.out.println(postings.startOffset());
* System.out.println(postings.endOffset());
* System.out.println(postings.getPayload());
* }
* }
* </pre>
*
* <h3>Impacts</h3>
*
* <p>TermsEnum also allows returning an {@link org.apache.lucene.index.ImpactsEnum}, an extension
* of PostingsEnum that exposes pareto-optimal tuples of (term frequency, length normalization
* factor) per block of postings. It is typically used to compute the maximum possible score over
* these blocks of postings, so that they can be skipped if they cannot possibly produce a
* competitive hit.
*
* <pre class="prettyprint">
* int docid;
* ImpactsEnum impactsEnum = termsEnum.impacts(PostingsEnum.FREQS);
* int targetDocID = 420;
* impactsEnum.advanceShallow(targetDocID);
* // These impacts expose pareto-optimal tuples of (termFreq, lengthNorm) over various ranges of doc IDs.
* Impacts impacts = impactsEnum.getImpacts();
* for (int level = 0; level &lt; impacts.numLevels(); i++) {
* int docIdUpTo = impacts.getDocIdUpTo(level);
* // List of pareto-optimal (termFreq, lengthNorm) tuples between targetDocID inclusive and docIdUpTo inclusive.
* List&lt;Impact&gt; perLevelImpacts = impacts.getImpacts(level);
* }
* </pre>
*
* <a id="stats"></a>
*
* <h2>Index Statistics</h2>
Expand Down Expand Up @@ -286,7 +300,6 @@
* (excluding deleted documents) in the index.
* <li>{@link org.apache.lucene.index.IndexReader#numDeletedDocs}: Returns the number of deleted
* documents in the index.
* <li>{@link org.apache.lucene.index.Fields#size}: Returns the number of indexed fields.
* </ul>
*
* <a id="documentstats"></a>
Expand Down
17 changes: 14 additions & 3 deletions lucene/core/src/java/org/apache/lucene/search/package-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@
* important Query classes. For details on implementing your own Query class, see <a
* href="#customQueriesExpert">Custom Queries -- Expert Level</a> below.
*
* <p>Make sure to look at {@link org.apache.lucene.search.Query} factory methods on {@link
* org.apache.lucene.index.IndexableField}s that you feed into the index writer, they are convenient
* to use and sometimes more efficient than a naively constructed {@link
* org.apache.lucene.search.Query}. See {@link
* org.apache.lucene.document.LongField#newRangeQuery(String, long, long)} for instance.
*
* <p>To perform a search, applications usually call {@link
* org.apache.lucene.search.IndexSearcher#search(Query,int)}.
*
Expand Down Expand Up @@ -204,7 +210,8 @@
* documents that need to be scored based on boolean logic in the Query specification, and then
* ranks this subset of matching documents via the retrieval model. For some valuable references on
* VSM and IR in general refer to <a
* href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
* href="https://cwiki.apache.org/confluence/display/LUCENEJAVA/InformationRetrieval">Lucene Wiki IR
* references</a>.
*
* <p>The rest of this document will cover <a href="#scoringBasics">Scoring basics</a> and explain
* how to change your {@link org.apache.lucene.search.similarities.Similarity Similarity}. Next, it
Expand Down Expand Up @@ -253,8 +260,12 @@
* org.apache.lucene.index.IndexWriterConfig#setSimilarity(org.apache.lucene.search.similarities.Similarity)
* IndexWriterConfig.setSimilarity(Similarity)} and at query-time with {@link
* org.apache.lucene.search.IndexSearcher#setSimilarity(org.apache.lucene.search.similarities.Similarity)
* IndexSearcher.setSimilarity(Similarity)}. Be sure to use the same Similarity at query-time as at
* index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
* IndexSearcher.setSimilarity(Similarity)}. Be sure to use search-time similarities that encode the
* length normalization factor the same way as the similarity that you used at index time. All
* Lucene built-in similarities use the default encoding so they are compatible, but if you use a
* custom similarity that changes the encoding of the length normalization factor, you are on your
* own: Lucene makes no effort to ensure that the index-time and the search-time similarities are
* compatible.
*
* <p>You can influence scoring by configuring a different built-in Similarity implementation, or by
* tweaking its parameters, subclassing it to override behavior. Some implementations also offer a
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,14 @@
* org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b} parameter to
* {@code 0}.
*
* <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both
* indexing and searching, and the changes must happen before either of these actions take place.
* Although in theory there is nothing stopping you from changing mid-stream, it just isn't
* well-defined what is going to happen.
* <p>To switch to a {@link org.apache.lucene.search.similarities.Similarity} that encodes the
* length normalization differently, one must do so for both indexing and searching, and the changes
* must happen before either of these actions take place. Note that all of Lucene's built-in
* similarities - and more generally all {@link org.apache.lucene.search.similarities.Similarity}
* sub-classes that don't override {@link
* org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState)}
* - encode the length normalization factor the same way, so it is fine to change the similarity at
* search-time without recreating the index.
*
* <p>To make this change, implement your own {@link
* org.apache.lucene.search.similarities.Similarity} (likely you'll want to simply subclass {@link
Expand Down
Loading

0 comments on commit 1aa9e75

Please sign in to comment.