Apache Lucene is a high-performance, full featured text search engine library written in Java.

The Apache Software Foundation

Last update: Jan 5, 2023

Overview

Apache Lucene

Apache Lucene is a high-performance, full featured text search engine library written in Java.

Online Documentation

This README file only contains basic setup instructions. For more comprehensive documentation, visit:

Lucene: http://lucene.apache.org/core/documentation.html

Building with Gradle

Basic steps:

Install OpenJDK 11 (or greater)
Download Lucene from Apache and unpack it (or clone the git repository).
Run gradle launcher script (gradlew).

Step 0) Set up your development environment (OpenJDK 11 or greater)

We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README. Lucene runs with Java 11 or later.

Lucene uses Gradle for build control.

NOTE: Lucene changed from Ant to Gradle as of release 9.0. Prior releases still use Ant.

Step 1) Checkout/Download Lucene source code

You can clone the source code from GitHub:

https://github.com/apache/lucene

or get Lucene source archives for a particular release from:

https://lucene.apache.org/core/downloads.html

Download either a zip or a tarred/gzipped version of the archive, and uncompress it into a directory of your choice.

Step 2) Run Gradle

Run "./gradlew help", this will show the main tasks that can be executed to show help sub-topics.

If you want to build Lucene, type:

./gradlew assemble

NOTE: DO NOT use gradle command that is already installed on your machine (unless you know what you'll do). The "gradle wrapper" (gradlew) does the job - downloads the correct version of it, setups necessary configurations.

The first time you run Gradle, it will create a file "gradle.properties" that contains machine-specific settings. Normally you can use this file as-is, but it can be modified if necessary.

./gradlew check will assemble Lucene and run all validation tasks (including unit tests).

./gradlew help will print a list of help guides that help understand how the build and typical workflow works.

If you want to build the documentation, type:

./gradlew documentation

Gradle build and IDE support

IntelliJ - IntelliJ idea can import the project out of the box.
Eclipse - Basic support (help/IDEs.txt).
Netbeans - Not tested.

Contributing

Please review the Contributing to Lucene Guide for information on contributing.

Discussion and Support

Users Mailing List
Developers Mailing List
Issue Tracker
IRC: #lucene and #lucene-dev on freenode.net

Comments

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483]
This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

Introduces

MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.

TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.

FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.

FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.

FieldComparatorSource - new class to allow for custom Comparators.

Alters

IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;)

Deprecates

TopFieldDocCollector

FieldSortedHitQueue

Migrated from LUCENE-1483 by Mark Miller (@markrmiller), 1 vote, resolved Feb 02 2009 Attachments: LUCENE-1483.patch (versions: 35), LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, sortBench.py, sortCollate.py Linked issues:

#2381

#3793

type:enhancement legacy-jira-resolution:Fixed legacy-jira-priority:Minor legacy-jira-fix-version:2.9 affects-version:2.9
opened by asfimport 319
Further steps towards flexible indexing [LUCENE-1458]
I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against).

[Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o]

There's still plenty to do before this is committable! This is a rather large change:

Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term & long offset (not a TermInfo). At seek points, tis encodes term & freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done.

Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading & writing in the codec chain – the current docs/prox format is captured in:

FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file).

This part is basically done.

Introduces a new "flex" API for iterating through the fields, terms, docs and positions:

FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum

This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat.

Next steps:

Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions.

Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API.

Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum – this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute.

Test performance & iterate.

Migrated from LUCENE-1458 by Michael McCandless (@mikemccand), 1 vote, resolved Dec 03 2009 Attachments: LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, LUCENE-1458.patch (versions: 13), LUCENE-1458.tar.bz2 (versions: 7), LUCENE-1458-back-compat.patch (versions: 6), LUCENE-1458-DocIdSetIterator.patch (versions: 2), LUCENE-1458-MTQ-BW.patch, LUCENE-1458-NRQ.patch, UnicodeTestCase.patch (versions: 2) Linked issues:

#3100

type:enhancement legacy-jira-resolution:Fixed module:core/index legacy-jira-priority:Minor legacy-jira-fix-version:4.0-ALPHA affects-version:4.0-ALPHA
opened by asfimport 256
Per thread DocumentsWriters that write their own private segments [LUCENE-2324]
See #3369 for motivation and more details.

I'm copying here Mike's summary he posted on 2293:

Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and "normal" segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO & CPU.

Migrated from LUCENE-2324 by Michael Busch, 1 vote, resolved Apr 28 2011 Attachments: ASF.LICENSE.NOT.GRANTED--lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch (versions: 4), LUCENE-2324-SMALL.patch (versions: 5), test.out (versions: 4) Linked issues:

#3388

#4102

#4030

#3955

#3647

#3369

type:enhancement legacy-jira-resolution:Fixed module:core/index legacy-jira-priority:Minor legacy-jira-fix-version:Realtime Branch
opened by asfimport 241
Automaton Query/Filter (scalable regex) [LUCENE-1606]
Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:

lexicography/etc on large text corpora

looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:

#3186

#3187

#3166

type:enhancement legacy-jira-resolution:Fixed legacy-jira-priority:Minor module:core/search legacy-jira-fix-version:4.0-ALPHA
opened by asfimport 224
Integrate lat/lon BKD and spatial3d [LUCENE-6699]
I'm opening this for discussion, because I'm not yet sure how to do this integration, because of my ignorance about spatial in general and spatial3d in particular :)

Our BKD tree impl is very fast at doing lat/lon shape intersection (bbox, polygon, soon distance: LUCENE-6698) against previously indexed points.

I think to integrate with spatial3d, we would first need to record lat/lon/z into doc values. Somewhere I saw discussion about how we could stuff all 3 into a single long value with acceptable precision loss? Or, we could use BinaryDocValues? We need all 3 dims available to do the fast per-hit query time filtering.

But, second: what do we index into the BKD tree? Can we "just" index earth surface lat/lon, and then at query time is spatial3d able to give me an enclosing "surface lat/lon" bbox for a 3d shape? Or ... must we index all 3 dimensions into the BKD tree (seems like this could be somewhat wasteful)?

Migrated from LUCENE-6699 by Michael McCandless (@mikemccand), 1 vote, resolved Sep 02 2015 Attachments: Geo3DPacking.java, LUCENE-6699.patch (versions: 26) Linked issues:

#7539

type:enhancement legacy-jira-priority:Major legacy-jira-resolution:Fixed legacy-jira-fix-version:6.0 legacy-jira-fix-version:5.4
opened by asfimport 220
if a filter can support random access API, we should use it [LUCENE-1536]
I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API.

This was inspired by #2550, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit.

Some notes on the test:

Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.

I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. "u s" means "united states" (phrase search).

I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.99999 (filter is non-null but all bits are set), 100 (filter=null, control)).

Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today).

Baseline (QPS) is current trunk, where filter is applied as iterator up "high" (ie in IndexSearcher's search loop).

Migrated from LUCENE-1536 by Michael McCandless (@mikemccand), 2 votes, resolved Oct 25 2011 Attachments: CachedFilterIndexReader.java, changes-yonik-uwe.patch, LUCENE-1536_hack.patch, LUCENE-1536.patch (versions: 29), LUCENE-1536-rewrite.patch (versions: 8), luceneutil.patch Linked issues:

#4577

#4285

#5614

SOLR-3062

type:enhancement legacy-jira-resolution:Fixed legacy-jira-priority:Minor module:core/search legacy-jira-fix-version:4.0-ALPHA affects-version:2.4 legacy-jira-label:mentor legacy-jira-label:gsoc2011 legacy-jira-label:lucene-gsoc-11
opened by asfimport 211
Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878]
Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec.

So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute.

I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).

The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)

Migrated from LUCENE-2878 by Simon Willnauer (@s1monw), 11 votes, resolved Apr 11 2018 Attachments: LUCENE-2878_trunk.patch (versions: 2), LUCENE-2878.patch (versions: 30), LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch (versions: 2) Linked issues:

#5590

Sub-tasks:

#4391

#4392

#4393

#4394

#5617

#5618

#5621

#5638

type:enhancement legacy-jira-priority:Major module:core/search legacy-jira-label:gsoc2014 legacy-jira-fix-version:8.0 legacy-jira-resolution:Implemented legacy-jira-fix-version:7.4 affects-version:Positions Branch
opened by asfimport 209
Separately specify a field's type [LUCENE-2308]
This came up from dicussions on IRC. I'm summarizing here...

Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc.

I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields.

The Field instance would still hold the actual value.

We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper).

This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index.

This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters...

Migrated from LUCENE-2308 by Michael McCandless (@mikemccand), 2 votes, resolved Mar 18 2013 Attachments: LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch (versions: 5), LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-FT-interface.patch (versions: 4), LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch Linked issues:

#3393

#4250

Sub-tasks:

#3386

type:enhancement legacy-jira-priority:Major legacy-jira-resolution:Fixed module:core/index legacy-jira-label:mentor legacy-jira-label:gsoc2011 legacy-jira-label:lucene-gsoc-11 legacy-jira-fix-version:4.0
opened by asfimport 200
Make Luke a Lucene/Solr Module [LUCENE-2562]
see "RE: Luke - in need of maintainer": http://markmail.org/message/m4gsto7giltvrpuf "Web-based Luke": http://markmail.org/message/4xwps7p7ifltme5q

I think it would be great if there was a version of Luke that always worked with trunk - and it would also be great if it was easier to match Luke jars with Lucene versions.

While I'd like to get GWT Luke into the mix as well, I think the easiest starting point is to straight port Luke to another UI toolkit before abstracting out DTO objects that both GWT Luke and Pivot Luke could share.

I've started slowly converting Luke's use of thinlet to Apache Pivot. I haven't/don't have a lot of time for this at the moment, but I've plugged away here and there over the past work or two. There is still a lot to do.

Migrated from LUCENE-2562 by Mark Miller (@markrmiller), 11 votes, resolved Apr 24 2019 Attachments: LUCENE-2562.patch (versions: 3), LUCENE-2562-ivy.patch, LUCENE-2562-Ivy.patch (versions: 3), luke1.jpg, luke2.jpg, luke3.jpg, Luke-ALE-1.png, Luke-ALE-2.png, Luke-ALE-3.png, Luke-ALE-4.png, Luke-ALE-5.png, lukeALE-documents.png, luke-javafx1.png, luke-javafx2.png, luke-javafx3.png, screenshot-1.png, スクリーンショット 2018-11-05 9.19.47.png Linked issues:

LEGAL-396

Pull requests: https://github.com/apache/lucene-solr/pull/420, https://github.com/apache/lucene-solr/pull/490, https://github.com/apache/lucene-solr/pull/512
type:task legacy-jira-priority:Major legacy-jira-resolution:Fixed legacy-jira-label:gsoc2014 module:luke legacy-jira-fix-version:8.1 legacy-jira-fix-version:9.0
opened by asfimport 196
Break out StorableField from IndexableField [LUCENE-3312]
In the field type branch we have strongly decoupled Document/Field/FieldType impl from the indexer, by having only a narrow API (IndexableField) passed to IndexWriter. This frees apps up use their own "documents" instead of the "user-space" impls we provide in oal.document.

Similarly, with #4382, we've done the same thing on the doc/field retrieval side (from IndexReader), with the StoredFieldsVisitor.

But, maybe we should break out StorableField from IndexableField, such that when you index a doc you provide two Iterables – one for the IndexableFields and one for the StorableFields. Either can be null.

One downside is possible perf hit for fields that are both indexed & stored (ie, we visit them twice, lookup their name in a hash twice, etc.). But the upside is a cleaner separation of concerns in API....

Migrated from LUCENE-3312 by Michael McCandless (@mikemccand), 2 votes, resolved Sep 02 2012 Attachments: LUCENE-3312-DocumentIterators-uwe.patch, lucene-3312-patch-01.patch, lucene-3312-patch-02.patch, lucene-3312-patch-03.patch, lucene-3312-patch-04.patch, lucene-3312-patch-05.patch, lucene-3312-patch-06.patch, lucene-3312-patch-07.patch, lucene-3312-patch-08.patch, lucene-3312-patch-09.patch, lucene-3312-patch-10.patch, lucene-3312-patch-11.patch, lucene-3312-patch-12.patch, lucene-3312-patch-12a.patch, lucene-3312-patch-13.patch, lucene-3312-patch-14.patch, LUCENE-3312-reintegration.patch (versions: 2) Linked issues:

#5702

#5413

type:enhancement legacy-jira-priority:Major legacy-jira-resolution:Fixed module:core/index legacy-jira-fix-version:6.0 legacy-jira-label:lucene-gsoc-12 legacy-jira-label:gsoc2012
opened by asfimport 184
AttributeSource/TokenStream API improvements [LUCENE-1693]
This patch makes the following improvements to AttributeSource and TokenStream/Filter:

introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces.

new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface->instance mappings to the attribute map for each of the found interfaces.

removes the set/getUseNewAPI() methods (including the standard ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden).

Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test).

AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before.

Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses.

Tee- and SinkTokenizer were deprecated, because they use Token instances for caching. This is not compatible to the new API using AttributeSource.State objects. You can still use the old deprecated ones, but new features provided by new Attribute types may get lost in the chain. A replacement is a new TeeSinkTokenFilter, which has a factory to create new Sink instances, that have compatible attributes. Sink instances created by one Tee can also be added to another Tee, as long as the attribute implementations are compatible (it is not possible to add a sink from a tee using one Token instance to a tee using the six separate attribute impls). In this case UOE is thrown.

The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning.

This issue contains one backwards-compatibility break: TokenStreams/Filters/Tokenizers should normally be final (see #2827 for the explaination). Some of these core classes are not final and so one could override the next() or next(Token) methods. In this case, the backwards-wrapper would automatically use incrementToken(), because it is implemented, so the overridden method is never called. To prevent users from errors not visible during compilation or testing (the streams just behave wrong), this patch makes all implementation methods final (next(), next(Token), incrementToken()), whenever the class itsself is not final. This is a BW break, but users will clearly see, that they have done something unsupoorted and should better create a custom TokenFilter with their additional implementation (instead of extending a core implementation).

For further changing contrib token streams the following procedere should be used:

rewrite and replace next(Token)/next() implementations by new API

if the class is final, no next(Token)/next() methods needed (must be removed!!!)

if the class is non-final add the following methods to the class: {code:java} /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should

not be overridden. Delegates to the backwards compatibility layer. */ public final Token next(final Token reusableToken) throws java.io.IOException { return super.next(reusableToken); }

/\*\* `@deprecated` Will be removed in Lucene 3.0. This method is final, as it should

not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super.next(); } {code} Also the incrementToken() method must be final in this case (and the new method end() of LUCENE-1448)

Migrated from LUCENE-1693 by Michael Busch, resolved Jul 24 2009 Attachments: lucene-1693.patch (versions: 4), LUCENE-1693.patch (versions: 15), LUCENE-1693-TokenizerAttrFactory.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java (versions: 4) Linked issues:

#2770

#2769

#2771

#2534

type:enhancement legacy-jira-resolution:Fixed module:analysis legacy-jira-priority:Minor legacy-jira-fix-version:2.9
opened by asfimport 172
Getting exception on search after upgrading to Lucene 9.4
Description

After upgrading from Lucene 9.3.0 to Lucene 9.4.2 the index search with sorting by description throws the following exception:

Caused by: java.lang.IllegalStateException: Term [77 73 64 66 6a 66 73 67 73 20 61 64 6b 66 64 6a 68 74 67 64 67 20 61 64 6b 66 64 6a 68 74 67 64 67 20 72 65 74 72 65 72 74 65 20] exists in doc values but not in the terms index at org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.init(TermOrdValComparator.java:582) at org.apache.lucene.search.comparators.TermOrdValComparator$CompetitiveIterator.update(TermOrdValComparator.java:553) at org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.updateCompetitiveIterator(TermOrdValComparator.java:457) at org.apache.lucene.search.comparators.TermOrdValComparator$TermOrdValLeafComparator.setHitsThresholdReached(TermOrdValComparator.java:284) at org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector.countHit(TopFieldCollector.java:86) at org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1.collect(TopFieldCollector.java:202) at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:305) at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:247) at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:662) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:636) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:553)

I wrote a small java function that reproduces the issue. Increasing maxRows value there to above 200 somehow resolves the problem.

public static void test() throws Exception { String name = "description"; String content = "content"; File dir = new File("C:\\test"); MMapDirectory directory = new MMapDirectory(dir.toPath()); Analyzer analyzer = new StandardAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET); boolean createIndex = true; // must be changed to false after calling this function first time if (createIndex) { String[] values = { "anshduvfv ", "dhisdefihfhg ", "Afasdfasdf ", "Retrerte ", "bnbssdfgfg ", "wrgfhhjg ", "jfhtvg ", "fhdhfdsads ", "Wsdfjfsgs ", "adkfdjhtgdg ", }; Random random = new Random(); IndexWriterConfig config = new IndexWriterConfig(analyzer); config.setOpenMode(OpenMode.CREATE_OR_APPEND); IndexWriter writer = new IndexWriter(directory, config); for (int i = 0; i < 110; i++) { for (int j = 0; j < 10; j++) { String value = values[j] + values[random.nextInt(10)] + values[random.nextInt(10)] + values[random.nextInt(10)]; Document doc = new Document(); doc.add(new TextField(content, value, Field.Store.NO)); doc.add(new StringField(name, value, Field.Store.YES)); doc.add(new SortedDocValuesField(name, new BytesRef(value.toLowerCase()))); // case-insensitive sorting writer.addDocument(doc); } } writer.close(); } int maxRows = 100; String request = "*:*"; DirectoryReader reader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(content, analyzer); parser.setSplitOnWhitespace(true); parser.setAllowLeadingWildcard(false); Query query = parser.parse(request); Sort sort = new Sort(new SortField(name, Type.STRING, true)); TopDocs docs = searcher.search(query, maxRows, sort); reader.close(); }

Version and environment details

Upgraded from Lucene 9.3.0 to Lucene 9.4.2. Using lucene-backward-codecs-9.4.2.jar OS: MS Windows 11 Java: jdk-11.0.14
type:bug
opened by vstrout 0
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)
This completes the refactoring as described in: https://github.com/apache/lucene/issues/11963

This commit:

splits out ByteVectorValues from VectorValues.

Adds getByteVectorValues(String field) to KnnVectorsReader

Adds a new KnnByteVectorField and disallows BytesRef values in the KnnVectorField

No longer allows ByteVectorValues to be read from a KnnVectorField.

These refactors are difficult to split up any further.
opened by benwtrent 2

org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs#testMPQ3 fails reproducible

Description

Policeman Jenkins failed when executing this test on OpenJ9 JVM, but actually it is reproducible:

Java: 64bit/openj9/jdk-17.0.5 -XX:-UseCompressedOops -Xgcpolicy:gencon

1 tests failed.
FAILED:  org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs.testMPQ3

Error Message:
java.lang.AssertionError: expected:<0.5828427076339722> but was:<0.5828428268432617>

Stack Trace:
java.lang.AssertionError: expected:<0.5828427076339722> but was:<0.5828428268432617>
    at __randomizedtesting.SeedInfo.seed([E853C78F22129ACC:708B13FE3EC459FF]:0)
    at app//org.junit.Assert.fail(Assert.java:89)
    at app//org.junit.Assert.failNotEquals(Assert.java:835)
    at app//org.junit.Assert.assertEquals(Assert.java:555)
    at app//org.junit.Assert.assertEquals(Assert.java:685)
    at app//org.apache.lucene.tests.search.CheckHits.verifyExplanation(CheckHits.java:503)
    at app//org.apache.lucene.tests.search.CheckHits.verifyExplanation(CheckHits.java:428)
    at app//org.apache.lucene.tests.search.CheckHits.verifyExplanation(CheckHits.java:428)
    at app//org.apache.lucene.tests.search.CheckHits$ExplanationAsserter.collect(CheckHits.java:623)
    at app//org.apache.lucene.tests.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
    at app//org.apache.lucene.tests.search.AssertingCollector$1.collect(AssertingCollector.java:66)
    at app//org.apache.lucene.tests.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
    at app//org.apache.lucene.tests.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
    at app//org.apache.lucene.tests.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
    at app//org.apache.lucene.tests.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
    at app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:282)
    at app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:254)
    at app//org.apache.lucene.tests.search.AssertingBulkScorer.score(AssertingBulkScorer.java:101)
    at app//org.apache.lucene.search.ReqExclBulkScorer.score(ReqExclBulkScorer.java:46)
    at app//org.apache.lucene.tests.search.AssertingBulkScorer.score(AssertingBulkScorer.java:101)
    at app//org.apache.lucene.search.TimeLimitingBulkScorer.score(TimeLimitingBulkScorer.java:76)
    at app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
    at app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:776)
    at app//org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:78)
    at app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:694)
    at app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:688)
    at app//org.apache.lucene.tests.search.CheckHits.checkExplanations(CheckHits.java:336)
    at app//org.apache.lucene.tests.search.QueryUtils.checkExplanations(QueryUtils.java:114)
    at app//org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:144)
    at app//org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:140)
    at app//org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:129)
    at app//org.apache.lucene.tests.search.CheckHits.checkHitCollector(CheckHits.java:105)
    at app//org.apache.lucene.tests.search.BaseExplanationTestCase.qtest(BaseExplanationTestCase.java:110)
    at app//org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs.qtest(TestSimpleExplanationsWithFillerDocs.java:116)
    at app//org.apache.lucene.search.TestSimpleExplanations.testMPQ3(TestSimpleExplanations.java:229)
    at [[email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0](mailto:[email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0)(Native Method)
    at [[email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke](mailto:[email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke)(NativeMethodAccessorImpl.java:77)
    at [[email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke](mailto:[email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke)(DelegatingMethodAccessorImpl.java:43)
    at [[email protected]/java.lang.reflect.Method.invoke](mailto:[email protected]/java.lang.reflect.Method.invoke)(Method.java:568)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at [[email protected]/java.lang.Thread.run](mailto:[email protected]/java.lang.Thread.run)(Thread.java:857)

Reproduce failure on Java 19 (Hotspot):

org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs > test suite's output saved to C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\build\test-results\test\outputs\OUTPUT-org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs.txt, copied below:
  2> Jan. 03, 2023 6:30:02 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
  2> INFORMATION: Using MemorySegmentIndexInput with Java 19
   >     java.lang.AssertionError: expected:<0.5828427076339722> but was:<0.5828428268432617>
   >         at __randomizedtesting.SeedInfo.seed([E853C78F22129ACC:708B13FE3EC459FF]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:555)
   >         at org.junit.Assert.assertEquals(Assert.java:685)
   >         at org.apache.lucene.tests.search.CheckHits.verifyExplanation(CheckHits.java:503)

Version and environment details

Main and 9.x branches

Reproduce line

gradlew test --tests TestSimpleExplanationsWithFillerDocs.testMPQ3 -Dtests.seed=E853C78F22129ACC -Dtests.multiplier=3 -Dtests.locale=wo -Dtests.timezone=EST5EDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8

type:bug legacy-jira-label:test-failure

opened by uschindler 0

ban finalizers in the build somehow (worst-case: use error-prone)

Description

I was looking at new error-prone checks in #12056 and there's a new check to ban finalizers.

Because the method is in the built-in JDK deprecated list (e.g. https://github.com/policeman-tools/forbidden-apis/blob/main/src/main/resources/de/thetaphi/forbiddenapis/signatures/jdk-deprecated-11.txt#L195), I would expect the check to fail if i override finalize, but it doesn't because in most cases finalizer will not actually CALL object.finalize.

Let's ban finalizers completely, one way or another though, we don't want them to sneak in. We can always enable the error-prone check for it as one solution.
type:bug

opened by rmuir 6
Better skipping for multi-term queries with a FILTER rewrite.
Currently multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a DocIdSetBuilder. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the DocIdSetBuilder that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the DocIdSetBuilder. This provides two main benefits:

Queries are less likely to allocate a FixedBitSet of size maxDoc.

Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown.

The slowdown is unfortunate, but my gut feeling is that this change still has more pros than cons.
opened by jpountz 6

Releases(releases/lucene/9.4.2)

releases/lucene/9.4.2(Dec 2, 2022)

Lucene 9.4.2 release

Full Changelog: https://github.com/apache/lucene/compare/releases/lucene/9.4.1...releases/lucene/9.4.2
Source code(tar.gz)
Source code(zip)
releases/lucene/9.4.1(Oct 28, 2022)

Source code(tar.gz)
Source code(zip)
releases/lucene/9.4.0(Oct 1, 2022)

Source code(tar.gz)
Source code(zip)