Apache Lucene and Solr open-source search software

Overview

Apache Lucene and Solr have separate repositories now!

Solr has become a top-level Apache project and main line development for Lucene and Solr is happening in each project's git repository now:

Development for branch 8x remains in the shared repository:

GitHub forks?

If you are using GitHub, make a clone of the corresponding repository mirror and create your pull requests against the main branch:

Comments
  • LUCENE-8982:  Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

    LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

    Description

    Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

    Solution

    Use ExtendedOpenOption.DIRECT with FileChannel for direct IO Solution reference code sample in

    • http://hg.openjdk.java.net/jdk10/master/rev/d72d7d55c765
    • https://bugs.openjdk.java.net/browse/JDK-8189192

    Tests

    • Pass existing tests
    • Need new benchmarking tests

    Checklist

    Please review the following and check all that apply:

    • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
    • [x] I have created a Jira issue and added the issue ID to my pull request title.
    • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
    • [x] I have developed this patch against the master branch.
    • [x] I have run ./gradlew check.
    • [x] I have added tests for my changes.
    • [ ] I have added documentation for the Ref Guide (for Solr changes only).
    enhancement cleanup 
    opened by zacharymorn 70
  • SOLR-14613: strongly typed placement plugin interface and implementation

    SOLR-14613: strongly typed placement plugin interface and implementation

    SOLR-14613: strongly typed initial proposal for plugin interface to replace Autoscaling

    This is not meant to be merged, it's to share early for feedback. Still WIP, any feedback most welcome before I invest more time.

    autoscaling 
    opened by murblanc 62
  • LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

    LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

    This commit introduces a shared counter based CollectorManager which allows accurate early termination across all its collectors, once enough hits have been globally collected.

    opened by atris 43
  • LUCENE-9047: Move the Directory APIs to be little endian

    LUCENE-9047: Move the Directory APIs to be little endian

    Directory API is now little endian. Note that codecs still work on Big endian for backwards compatibility, therefore they reverse the bytes whenever they are writing / reading short, ints and longs.

    CodecUtils for header and footers has been modified to be little Indian. Still the version and checksum will be written / read reversing bytes for backwards compatibility.

    SegmentInfos is read / written in little endian, for previous version, the IndexInput is wrapped for backwards compatibility.

    opened by iverase 34
  • LUCENE-8962: Merge segments on getReader

    LUCENE-8962: Merge segments on getReader

    Add IndexWriter merge-on-refresh feature to selectively merge small segments on getReader, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching.

    opened by s1monw 33
  • LUCENE-9280: Collectors to skip noncompetitive documents

    LUCENE-9280: Collectors to skip noncompetitive documents

    Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents

    This could be useful if we want to sort by some field.

    opened by mayya-sharipova 31
  • LUCENE-9317: Clean up split package in analyzers-common

    LUCENE-9317: Clean up split package in analyzers-common

    This is a draft pull request for review, to try to clean up package name conflicts between analyzers-common and core. Also I tried to make necessary changes as small as possible.

    See https://issues.apache.org/jira/browse/LUCENE-9317 for more background.

    The main changes are:

    • Move analysis base classes to lucene-core (o.a.l.a) from analyzers-common (o.a.l.a.util)
    • Rename all service provider files (META-INF/services/...).
    • Move o.a.l.a.standard.StandardTokenizer to lucene-core
    • Split o.a.l.a.standard in analyzers-common into o.a.l.a.classic and o.a.l.a.email

    With above changes, there is no package name conflicts.

    • o.a.l.a.util and newly created o.a.l.a.classic and o.a.l.a.email only exist in analyzers-common
    • o.a.l.a.standard only exists in lucene-core
    • other packages are not touched.

    Compiling whole Lucene/Solr main classes are fine, thanks to IDE's refactoring feature.

    Tasks to be done:

    • Create fake factory base classes in o.a.l.a.util for backward compatibility (?)
    • Fix tests
    • Fix gradle scripts (?)
    opened by mocobeta 29
  • SOLR-14680: Provide simple interfaces to our cloud classes  (only API)

    SOLR-14680: Provide simple interfaces to our cloud classes (only API)

    A few notes before anyone who starts reviewing this

    • This was created after I saw similar attempt as a part #1684 . I believe this has to receive a more wider input and review irrespective of whether devs are interested in autoscaling or not
    • This is a WIP PR
    • The concrete implementations are for demo purposes. Can be omitted, if required. Anything outside the o.a.s.cluster.api package is optional and will be removed
    • The interfaces are designed to be minimal to avoid overload. We can and will add more methods later. Let's not add a lot
    clean-api 
    opened by noblepaul 29
  • SOLR-14789: Absorb the docker-solr repo.

    SOLR-14789: Absorb the docker-solr repo.

    WIP - https://issues.apache.org/jira/browse/SOLR-14789

    This is intended for 9.0 only, and will not be backported to 8.x.
    We can continue using the docker-solr repo for all 8.x.y releases.

    Goals

    The goal of this PR is to move all functionality for how current solr images are built to lucene solr. However there are clearly a lot of other things we want to accomplish too:

    • Migrate functionality from the solr/docker/include/scripts into solr/bin or Solr itself
    • Support the "official" docker releases (https://hub.docker.com/_/solr not just https://hub.docker.com/apache/solr)
    • Make the solr docker file more "cloud native"

    Since this docker image will not be officially used until 9.0 at the earliest, this gives us time to iterate and accomplish all of these goals. However merging in this initial functionality that works like the current docker-solr setup first lets us more easily make and test these changes.

    Setup of /solr/docker

    gradle assemble will now create a docker image that has been created with contents of :solr:packaging.

    The "Solr image" is broken up into 2 docker files.

    1. The package dockerfile, which contains the solr release under /opt/solr-build/solr-${VERSION}. This image requires no other content at the end.
    2. The runtime dockerfile, which is given the package image name as an input and generates everything currently setup in the Solr docker image, copying the package contents from the package image.

    The runtime docker image is the one that will be pushed to Dockerhub eventually, and is run by users.

    The idea is that there is also a task to generate the release docker image (which can be found under /solr/docker/package/Dockerfile.release-package) that is then passed to the runtime docker image. But this is a future goal and not necessary in order to merge this PR IMO.

    I basically learned gradle to write this, so please give any advice on how I could make the build process better. I'm sure it is not optimal.

    Testing

    Once you run gradle assemble, you can test the new docker image in two ways:

    • gradle :solr:docker:test, which will run the docker-solr integration tests using the newly assembled solr image. This will eventually fail when it reaches the test that needs root permissions however, since gradle doesn't have root permissions itself. This is a TODO.
    • docker run -d -p 8983:8983 -e SOLR_JETTY_HOST="0.0.0.0" apache/solr:9.0.0-SNAPSHOT, then visit the admin screen at http://localhost:8983 .

    TODO:

    • [ ] Add ability to build with "release" package
    • [ ] Remove ability to push the package image.
    • [ ] Integrate gradle :solr:docker:tests into the overall gradle tests
    • [x] Make all of the tests work with gradle.
    opened by HoustonPutman 28
  • SOLR-14702: Remove oppressive language (part1)

    SOLR-14702: Remove oppressive language (part1)

    Description

    Please refer to this PR: https://github.com/apache/lucene-solr/pull/1711

    Solution

    Replaces master with primary and slave with secondary.

    Tests

    Run standard tests.

    Checklist

    Please review the following and check all that apply:

    • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
    • [x] I have created a Jira issue and added the issue ID to my pull request title.
    • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
    • [x] I have developed this patch against the master branch.
    • [x] I have run ant precommit and the appropriate test suite.
    • [ ] I have added tests for my changes.
    • [ ] I have added documentation for the Ref Guide (for Solr changes only).
    opened by MarcusSorealheis 28
  • Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

    Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

    This is just a draft PR for a first insight on memory mapping improvements in JDK 16+.

    Some background information: Starting with JDK-14, there is a new incubating module "jdk.incubator.foreign" that has a new, not yet stable API for accessing off-heap memory (and later it will also support calling functions using classical MethodHandles that are located in libraries like .so or .dll files). This incubator module has several versions:

    • first version: https://openjdk.java.net/jeps/370 (slow, very buggy and thread confinement, so making it unuseable with Lucene)
    • second version: https://openjdk.java.net/jeps/383 (still thread confinement, but now allows transfer of "ownership" to other threads; this is still impossible to use with Lucene.
    • third version in JDK 16: https://openjdk.java.net/jeps/393 (this version has included "Support for shared segments"). This now allows us to safely use the same external mmaped memory from different threads and also unmap it!

    This module more or less overcomes several problems:

    • ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 1 GiB portions)
    • There is no official way to unmap ByteBuffers when the file is no longer used. There is a way to use sun.misc.Unsafe and forcefully unmap segments, but any IndexInput accessing the file from another thread will crush the JVM with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the unsafe unmapping, but that's the main issue.

    @uschindler had many discussions with the team at OpenJDK and finally with the third incubator, we have an API that works with Lucene. It was very fruitful discussions (thanks to @mcimadamore !)

    With the third incubator we are now finally able to do some tests (especially performance). As this is an incubating module, this PR first changes a bit the build system:

    • disable -Werror for :lucene:core
    • add the incubating module to compiler of :lucene:core and enable it for all test builds. This is important, as you have to pass --add-modules jdk.incubator.foreign also at runtime!

    The code basically just modifies MMapDirectory to use LONG instead of INT for the chunk size parameter. In addition it adds MemorySegmentIndexInput that is a copy of our ByteBufferIndexInput (still there, but unused), but using MemorySegment instead of ByteBuffer behind the scenes. It works in exactly the same way, just the try/catch blocks for supporting EOFException or moving to another segment were rewritten.

    The openInput code uses MemorySegment.mapFile() to get a memory mapping. This method is unfortunately a bit buggy in JDK-16-ea-b30, so I added some workarounds. See JDK issues: https://bugs.openjdk.java.net/browse/JDK-8259027, https://bugs.openjdk.java.net/browse/JDK-8259028, https://bugs.openjdk.java.net/browse/JDK-8259032, https://bugs.openjdk.java.net/browse/JDK-8259034. The bugs with alignment and zero byte mmaps are fixed in b32, this PR was adapted (hacks removed).

    It passes all tests and it looks like you can use it to read indexes. The default chunk size is now 16 GiB (but you can raise or lower it as you like; tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case every index file is always mapped to one big memory mapping. My testing with Windows 10 have shown, that this is not a good idea!!!. Huge mappings fragment address space over time and as we can only use like 43 or 46 bits (depending on OS), the fragmentation will at some point kill you. So 16 GiB looks like a good compromise: Most files will be smaller than 6 GiB anyways (unless you optimize your index to one huge segment). So for most Lucene installations, the number of segments will equal the number of open files, so Elasticsearch huge user consumers will be very happy. The sysctl max_map_count may not need to be touched anymore.

    In addition, this implements readLELongs in a better way than @jpountz did (no caching or arbitrary objects). Nevertheless, as the new MemorySegment API relies on final, unmodifiable classes and coping memory from a MemorySegment to a on-heap Java array, it requires us to wrap all those arrays using a MemorySegment each time (e.g. in readBytes() or readLELongs), there may be some overhead du to short living object allocations (those are NOT reuseable!!!). In short: In future we should throw away on coping/loading our stuff to heap and maybe throw away IndexInput completely and base our code fully on random access. The new foreign-vector APIs will in future also be written with MemorySegment in its focus. So you can allocate a vector view on a MemorySegment and let the vectorizer fully work outside java heap inside our mmapped files! :-)

    It would be good if you could checkout this branch and try it in production.

    But be aware:

    • You need JDK 11 to run Gradle (set JAVA_HOME to it)
    • You need JDK 16-ea-b32 (set RUNTIME_JAVA_HOME to it)
    • The lucene-core.jar will be JDK16 class files and requires JDK-16 to execute.
    • Also you need to add --add-modules jdk.incubator.foreign to the command line of your Java program/Solr server/Elasticsearch server

    It would be good to get some benchmarks, especially by @rmuir or @mikemccand. Take your time and enjoy the complexity of setting this up! ;-)

    My plan is the following:

    • report any bugs or slowness, especially with Hotspot optimizations. The last time I talked to Maurizio, he taked about Hotspot not being able to fully optimize for-loops with long instead of int, so it may take some time until the full performance is there.
    • wait until the final version of project PANAMA-foreign goes into Java's Core Library (no module needed anymore)
    • add a MR-JAR for lucene-core.jar and compile the MemorySegmentIndexInput and maybe some helper classes with JDK 17/18/19 (hopefully?).

    ~~In addition there are some comments in the code talking about safety (e.g., we need IOUtils.close() taking AutoCloseable instead of just Closeable, so we can also enfoce that all memory segments are closed after usage.~~ In addition, by default all VarHandles are aligned. By default it refuses to read a LONG from an address which is not a multiple of 8. I had to disable this feature, as all our index files are heavily unaliged. We should in meantime not only convert our files to little endian, but also make all non-compressed types (like long[] arrays or non-encoded integers be aligned to the correct boundaries in files). The most horrible thing I have seen is that our CFS file format starts the "inner" files totally unaligned. We should fix the CFSWriter to start new files always at multiples of 8 bytes. I will open an issue about this.

    enhancement optimization 
    opened by uschindler 27
  • Added bulkclose feature to the githubPRs script

    Added bulkclose feature to the githubPRs script

    Example use:

    ./githubPRs.py \
      --bulkclose "Lucene and Solr development has moved to separate git repositories and this PR is being bulk-closed. Please open a new PR against https://github.com/apache/solr or https://github.com/apache/lucene if your contribution is still relevant to the project." \
      --token XXXXXXXXXXXXX
    

    Result of such an action can be seen in #1364 which I used for testing. You can then easily query GitHub for a list of the stale-closed PRs: https://github.com/apache/lucene-solr/pulls?q=label%3Astale-closed+is%3Aclosed

    opened by janhoy 2
  • SOLR-14726: Initial draft of a new quickstart guide

    SOLR-14726: Initial draft of a new quickstart guide

    A new quickstart guide that can potentially replace (or live side by side with) the Solr tutorial. This is WIP at the moment, but would appreciate early feedback and thoughts.

    opened by chatman 0
  • SOLR-15215: SolrJ: Remove Netty dependency

    SOLR-15215: SolrJ: Remove Netty dependency

    Netty is an optional dependency of Zookeeper; one must opt-in to use it. Netty remains a solr-core dependency transitively via Hadoop/HDFS.

    https://issues.apache.org/jira/browse/SOLR-15215

    opened by dsmiley 1
Owner
The Apache Software Foundation
The Apache Software Foundation
A proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework.

Lucene Serverless This project demonstrates a proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework. ✔️

Arseny Yankovsky 38 Oct 29, 2022
Apache Lucene is a high-performance, full featured text search engine library written in Java.

Apache Lucene is a high-performance, full featured text search engine library written in Java.

The Apache Software Foundation 1.4k Jan 5, 2023
Apache Lucene.NET

Apache Lucene.NET Full-text search for .NET Apache Lucene.NET is a .NET full-text search engine framework, a C# port of the popular Apache Lucene proj

The Apache Software Foundation 1.9k Jan 4, 2023
OpenSearch is an open source distributed and RESTful search engine.

OpenSearch is an open source search and analytics engine derived from Elasticsearch

null 6.2k Jan 1, 2023
🔍An open source GitLab/Gitee/Gitea code search tool. Kooder 是一个为 Gitee/GitLab 开发的开源代码搜索工具,这是一个镜像仓库,主仓库在 Gitee。

Kooder is a open source code search project, offering code, repositories and issues search service for code hosting platforms including Gitee, GitLab and Gitea.

开源中国 350 Dec 30, 2022
filehunter - Simple, fast, open source file search engine

Simple, fast, open source file search engine. Designed to be local file search engine for places where multiple documents are stored on multiple hosts with multiple directories.

null 32 Sep 14, 2022
Free and Open, Distributed, RESTful Search Engine

Elasticsearch A Distributed RESTful Search Engine https://www.elastic.co/products/elasticsearch Elasticsearch is a distributed RESTful search engine b

elastic 62.3k Dec 31, 2022
GitHub Search Engine: Web Application used to retrieve, store and present projects from GitHub, as well as any statistics related to them.

GHSearch Platform This project is made of two subprojects: application: The main application has two main responsibilities: Crawling GitHub and retrie

SEART - SoftwarE Analytics Research Team 68 Nov 25, 2022
A simple fast search engine written in java with the help of the Collection API which takes in multiple queries and outputs results accordingly.

A simple fast search engine written in java with the help of the Collection API which takes in multiple queries and outputs results accordingly.

Adnan Hossain 6 Oct 24, 2022
Simple full text indexing and searching library for Java

indexer4j Simple full text indexing and searching library for Java Install Gradle repositories { jcenter() } dependencies { compile 'com.haeun

Haeun Kim 47 May 18, 2022
Apache Solr is an enterprise search platform written in Java and using Apache Lucene.

Apache Solr is an enterprise search platform written in Java and using Apache Lucene. Major features include full-text search, index replication and sharding, and result faceting and highlighting.

The Apache Software Foundation 630 Dec 28, 2022
Path Finding Visualizer for Breadth first search, Depth first search, Best first search and A* search made with java swing

Path-Finding-Visualizer Purpose This is a tool to visualize search algorithms Algorithms featured Breadth First Search Deapth First Search Gready Best

Leonard 11 Oct 20, 2022
A proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework.

Lucene Serverless This project demonstrates a proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework. ✔️

Arseny Yankovsky 38 Oct 29, 2022
Apache Lucene is a high-performance, full featured text search engine library written in Java.

Apache Lucene is a high-performance, full featured text search engine library written in Java.

The Apache Software Foundation 1.4k Jan 5, 2023
The Chronix Server implementation that is based on Apache Solr.

Chronix Server The Chronix Server is an implementation of the Chronix API that stores time series in Apache Solr. Chronix uses several techniques to o

Chronix 262 Jul 3, 2022
Apache Lucene.NET

Apache Lucene.NET Full-text search for .NET Apache Lucene.NET is a .NET full-text search engine framework, a C# port of the popular Apache Lucene proj

The Apache Software Foundation 1.9k Jan 4, 2023
Search API with spelling correction using ngram-index algorithm: implementation using Java Spring-boot and MySQL ngram full text search index

Search API to handle Spelling-Corrections Based on N-gram index algorithm: using MySQL Ngram Full-Text Parser Sample Screen-Recording Screen.Recording

Hardik Singh Behl 5 Dec 4, 2021
简繁体汉字转拼音的项目,解决多音字的问题。ElasticSearch、solr 的拼音分词工具

pinyin-plus 汉字转拼音的库,有如下特点 拼音数据基于 cc-cedict 、kaifangcidian 开源词库 基于拼音词库的数据初始化分词引擎进行分词,准确度高,解决多音字的问题 支持繁体字 支持自定义词库,词库格式同 cc-cedict 字典格式 api 简单,分为普通模式、索引模

TapTap 103 Dec 25, 2022
OpenSearch is an open source distributed and RESTful search engine.

OpenSearch is an open source search and analytics engine derived from Elasticsearch

null 6.2k Jan 1, 2023
Apache Cayenne is an open source persistence framework licensed under the Apache License

Apache Cayenne is an open source persistence framework licensed under the Apache License, providing object-relational mapping (ORM) and remoting services.

The Apache Software Foundation 284 Dec 31, 2022