:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Overview

Elasticsearch Hadoop Build Status

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

See project page and documentation for detailed information.

Requirements

Elasticsearch (1.x or higher (2.x highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

ES-Hadoop 6.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, and 6.X

ES-Hadoop 5.x and higher are compatible with Elasticsearch 1.X, 2.X and 5.X

ES-Hadoop 2.2.x and higher are compatible with Elasticsearch 1.X and 2.X

ES-Hadoop 2.0.x and 2.1.x are compatible with Elasticsearch 1.X only

Installation

Stable Release (currently 7.12.0)

Available through any Maven-compatible tool:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>7.12.0</version>
</dependency>

or as a stand-alone ZIP.

Development Snapshot

Grab the latest nightly build from the repository again through Maven:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>8.0.0-SNAPSHOT</version>
</dependency>
<repositories>
  <repository>
    <id>sonatype-oss</id>
    <url>http://oss.sonatype.org/content/repositories/snapshots</url>
    <snapshots><enabled>true</enabled></snapshots>
  </repository>
</repositories>

or build the project yourself.

We do build and test the code on each commit.

Supported Hadoop Versions

Running against Hadoop 1.x is deprecated in 5.5 and will no longer be tested against in 6.0. ES-Hadoop is developed for and tested against Hadoop 2.x and YARN. More information in this section.

Feedback / Q&A

We're interested in your feedback! You can find us on the User mailing list - please append [Hadoop] to the post subject to filter it out. For more details, see the community page.

Online Documentation

The latest reference documentation is available online on the project home page. Below the README contains basic usage instructions at a glance.

Usage

Configuration Properties

All configuration properties start with es prefix. Note that the es.internal namespace is reserved for the library internal use and should not be used by the user at any point. The properties are read mainly from the Hadoop configuration but the user can specify (some of) them directly depending on the library used.

Required

es.resource=<ES resource location, relative to the host/port specified above>

Essential

es.query=<uri or query dsl query>              # defaults to {"query":{"match_all":{}}}
es.nodes=<ES host address>                     # defaults to localhost
es.port=<ES REST port>                         # defaults to 9200

The full list is available here

Map/Reduce

For basic, low-level or performance-sensitive environments, ES-Hadoop provides dedicated InputFormat and OutputFormat that read and write data to Elasticsearch. To use them, add the es-hadoop jar to your job classpath (either by bundling the library along - it's ~300kB and there are no-dependencies), using the DistributedCache or by provisioning the cluster manually. See the documentation for more information.

Note that es-hadoop supports both the so-called 'old' and the 'new' API through its EsInputFormat and EsOutputFormat classes.

'Old' (org.apache.hadoop.mapred) API

Reading

To read data from ES, configure the EsInputFormat on your job configuration along with the relevant properties:

JobConf conf = new JobConf();
conf.setInputFormat(EsInputFormat.class);
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*");             // replace this with the relevant query
...
JobClient.runJob(conf);

Writing

Same configuration template can be used for writing but using EsOuputFormat:

JobConf conf = new JobConf();
conf.setOutputFormat(EsOutputFormat.class);
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
...
JobClient.runJob(conf);

'New' (org.apache.hadoop.mapreduce) API

Reading

Configuration conf = new Configuration();
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*");             // replace this with the relevant query
Job job = new Job(conf)
job.setInputFormatClass(EsInputFormat.class);
...
job.waitForCompletion(true);

Writing

Configuration conf = new Configuration();
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
Job job = new Job(conf)
job.setOutputFormatClass(EsOutputFormat.class);
...
job.waitForCompletion(true);

Apache Hive

ES-Hadoop provides a Hive storage handler for Elasticsearch, meaning one can define an external table on top of ES.

Add es-hadoop-.jar to hive.aux.jars.path or register it manually in your Hive script (recommended):

ADD JAR /path_to_jar/es-hadoop-<version>.jar;

Reading

To read data from ES, define a table backed by the desired index:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*');

The fields defined in the table are mapped to the JSON when communicating with Elasticsearch. Notice the use of TBLPROPERTIES to define the location, that is the query used for reading from this table.

Once defined, the table can be used just like any other:

SELECT * FROM artists;

Writing

To write data, a similar definition is used but with a different es.resource:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists');

Any data passed to the table is then passed down to Elasticsearch; for example considering a table s, mapped to a TSV/CSV file, one can index it to Elasticsearch like this:

INSERT OVERWRITE TABLE artists
    SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;

As one can note, currently the reading and writing are treated separately but we're working on unifying the two and automatically translating HiveQL to Elasticsearch queries.

Apache Pig

ES-Hadoop provides both read and write functions for Pig so you can access Elasticsearch from Pig scripts.

Register ES-Hadoop jar into your script or add it to your Pig classpath:

REGISTER /path_to_jar/es-hadoop-<version>.jar;

Additionally one can define an alias to save some chars:

%define ESSTORAGE org.elasticsearch.hadoop.pig.EsStorage()

and use $ESSTORAGE for storage definition.

Reading

To read data from ES, use EsStorage and specify the query through the LOAD function:

A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');
DUMP A;

Writing

Use the same Storage to write data to Elasticsearch:

A = LOAD 'src/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray);
B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links;
STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();

Apache Spark

ES-Hadoop provides native (Java and Scala) integration with Spark: for reading a dedicated RDD and for writing, methods that work on any RDD. Spark SQL is also supported

Scala

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument:

import org.elasticsearch.spark._

..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")

Spark SQL

import org.elasticsearch.spark.sql._

// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")

// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))

Writing

Import the org.elasticsearch.spark._ package to gain savetoEs methods on your RDDs:

import org.elasticsearch.spark._

val conf = ...
val sc = new SparkContext(conf)

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Spark SQL

import org.elasticsearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")

Java

In a Java environment, use the org.elasticsearch.spark.rdd.java.api package, in particular the JavaEsSpark class.

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument.

import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);

JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");

Spark SQL

SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))

Writing

Use JavaEsSpark to index any RDD to Elasticsearch:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);

Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");

Spark SQL

import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;

DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")

Apache Storm

ES-Hadoop provides native integration with Storm: for reading a dedicated Spout and for writing a specialized Bolt

Reading

To read data from ES, use EsSpout:

import org.elasticsearch.storm.EsSpout;

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("es-spout", new EsSpout("storm/docs", "?q=me*"), 5);
builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-spout");

Writing

To index data to ES, use EsBolt:

import org.elasticsearch.storm.EsBolt;

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 10);
builder.setBolt("es-bolt", new EsBolt("storm/docs"), 5).shuffleGrouping("spout");

Building the source

Elasticsearch Hadoop uses Gradle for its build system and it is not required to have it installed on your machine. By default (gradlew), it automatically builds the package and runs the unit tests. For integration testing, use the integrationTests task. See gradlew tasks for more information.

To create a distributable zip, run gradlew distZip from the command line; once completed you will find the jar in build/libs.

To build the project, JVM 8 (Oracle one is recommended) or higher is required.

License

This project is released under version 2.0 of the Apache License

Licensed to Elasticsearch under one or more contributor
license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright
ownership. Elasticsearch licenses this file to you under
the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
Comments
  • [Feature] Spark3.0 support

    [Feature] Spark3.0 support

    Feature description

    Spark3 is currently in RC. Will there be support for Spark3 in the next release version (v8) or will we have to wait for v9 ? more precisely, do you guys plan to start supporting spark3 only when official spark3 release is made ?

    • spark3 now support only scala 2.12
    enhancement :Spark v8.0.0-alpha1 v7.12.0 
    opened by jonathanyuechun 62
  • EsHadoopIllegalStateException reading Geo-Shape into DataFrame - SparkSQL

    EsHadoopIllegalStateException reading Geo-Shape into DataFrame - SparkSQL

    1. Create an index type with a mapping consisting of a field of type geo_shape.
    2. Create an RDD[String] containing a polygon as GeoJSON, as the value of a field whose name matches the mapping: """{"rect":{"type":"Polygon","coordinates":[[[50,32],[69,32],[69,50],[50,50],[50,32]]],"crs":null}}"""
    3. Write to an index type in Elasticsearch: rdd1.saveJsonToEs(indexName+"/"+indexType, connectorConfig)
    4. Read into SparkSQL DataFrame with either esDF or read-format-load:
      • sqlContext.esDF(indexName+"/"+indexType, connectorConfig)
      • sqlContext.read.format("org.elasticsearch.spark.sql").options(connectorConfig).load(indexName+"/"+indexType)

    Result is: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'rect' not found; typically this occurs with arrays which are not mapped as single value Full stack trace in gist. Elasticsearch Hadoop v2.1.2

    bug :Spark v2.2.0 
    opened by randallwhitman 46
  • Document Count in ES Different from Number of Entries Pushed

    Document Count in ES Different from Number of Entries Pushed

    We are seeing an issue where the number of documents successfully created/indexed in ES is different (more--very odd, or less) from the number of documents which we are giving ES to index.

    Specifically, we are using spark to grab specific correctly formatted json from a repository and simply write it to ES into a new index. The difference in documents only seems to occur if there is a large number of documents being inserted (specifically we tested with 929,660) and too few shards (specifically 1). We also did a test at a larger scale (roughly 9 million documents and over 100 shards) which also gave incorrect document counts. We have ruled out this being a spark issue, but there are no apparent issues or exceptions in the logs (ES or Spark/ES-Hadoop), including when they are upped to debug. We do notice that ES does throttle the input. The fact that this only happens when large numbers of documents are created/indexed and that it seems to alleviate when more shards are added to balance the load points it to a load issue. I believe this to be an issue with ES core, but I was interested in seeing if anyone else here had seen this due to ES-hadoop being able to create a load issue more easily than the normal api. We have tried ES-hadoop 2.1.0-beta and 2.0.1 release with the same results.

    I can provide more details as needed, but as there is nothing in the logs, there are not many other observables that I can detail. It is very similar to: http://elasticsearch-users.115913.n3.nabble.com/missing-documents-after-bulk-indexing-td4056100.html, however I am not seeing any exceptions. Please let me know if anyone has seen this issue or has any resolution.

    bug :Rest v2.0.3 v2.1.0.Beta4 
    opened by mparker4 31
  • "Cannot find node with id" exception even when the node is alive and cluster is green.

    I am getting the following exception when pushing data from hadoop M/R job. When this happens, the node in question is responding and cluster is also healthy (green). Also, plenty of resources on the box. CPU usage is less than 30%, free memory is over 50G. With this exception, the hadoop map task is failing and getting restarted and eventually succeeding (may be by connecting to a different ES node). These errors are not consistent. They are very intermittent.

    org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find node with id [Q4pQkOIJSSi2oXRXGUVs8w]
        at org.elasticsearch.hadoop.util.Assert.notNull(Assert.java:40)
        at org.elasticsearch.hadoop.rest.RestRepository.getWriteTargetPrimaryShards(RestRepository.java:251)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.initSingleIndex(EsOutputFormat.java:218)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:201)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:159)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:638)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at afi.search.hadoop.es.ESMapper1.map(ESMapper1.java:227)
        at afi.search.hadoop.es.ESMapper1.map(ESMapper1.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
    
    bug :Rest 
    opened by skkolli 29
  • Upgrade to Spark 2.4 and support Scala 2.12

    Upgrade to Spark 2.4 and support Scala 2.12

    Feature description

    Spark 2.4.0 has recently been released, along with Scala 2.12 support. For those of us wanting to use Spark 2.4 features or even just enjoy Scala 2.12 binary compatibility, it would be great to update elasticsearch-spark-20 to use Spark 2.4 and support Scala 2.12 builds.

    I can offer a PR (currently based off the 6.5 branch) with changes that I've been using in production without issue (and which build against Scala 2.10, 2.11 and 2.12 successfully). If there is appetite for this, please confirm which branch to target.

    enhancement :Spark v8.0.0-alpha1 v7.12.0 
    opened by bpiper 28
  • EsSpark.esJsonRDD error

    EsSpark.esJsonRDD error

    Hi dude... I try to put a lot of document and get it. result:

    15/03/02 23:49:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    java.lang.IllegalArgumentException: Invalid position given=154612 -1
        at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:234)
        at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:165)
        at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:403)
        at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    15/03/02 23:49:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Invalid position given=154612 -1
        at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:234)
        at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:165)
        at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:403)
        at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    

    jsonPointers: doc = [154447,159056], fragments = 0, if you want I can send you the input. the input contain the following data: https://gist.githubusercontent.com/b0c1/486a7bea069e26d54b43/raw/a9a23bdbbbaf1a9b86918d302c3a838839b3a566/gistfile1.txt

    bug :Rest v2.1.0.Beta4 
    opened by b0c1 27
  • Insert slowdowns on Master branch

    Insert slowdowns on Master branch

    Hello

    I'm playing with the master branch to test the unique ID or parent child features. Testing it out I was surprised to see huge slowdowns in hive insertion queries. My test I have a table with 30 columns and I want to insert 4M entries. I have the same insert overwrite query but I have changed the deployed jar. CREATE EXTERNAL TABLE es_test( values String..STORED BY 'org.elasticsearch.hadoop.hive.ESStorageHandler' TBLPROPERTIES('es.resource' = 'test/test','es.host' = 'myhost')

    When I insert with current release I get a thoughput of 12k/s so 5 minutes for insertion with on or 2 map tasks failed on:

    java.lang.IllegalStateException: Cannot get response body for [POST]test/test/_bulk] at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:189)

    When I insert with the master the same data takes about 57 minutes for insertion I get around 50-60% failed map tasks with a lot of:

    java.lang.IllegalStateException: Cannot get response body for [PUT][test] at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:221)

    I/O exception (java.net.ConnectException) caught when processing request: Connection timed out and some EsRejectedExecutionExceptions bulk thread pool 50.

    I'll keep investigating to find a more usefull error or a more definite reason. Hope this posts helps.

    bug :Hive :Rest 
    opened by nmaillard 27
  • Unable to index JSON from HDFS using SchemaRDD.saveToEs()

    Unable to index JSON from HDFS using SchemaRDD.saveToEs()

    Elasticsearch-hadoop: elasticsearch-hadoop-2.1.0.BUILD-20150224.023403-309.zip Spark: 1.2.0

    I'm attempting to take a very basic JSON document on HDFS and index it in ES using SchemaRDD.saveToEs(). According to the documentation under "writing existing JSON to elasticsearch" it should be as easy as creating a SchemaRDD via SQLContext.jsonFile() and then index using .saveToEs(), but I'm getting an error.

    Replicating the problem:

    1. Create JSON file on hdfs with the content:
    {"key":"value"}
    
    1. Execute code in spark-shell
    import org.apache.spark.SparkContext._
    import org.elasticsearch.spark._
    
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext._
    
    val input = sqlContext.jsonFile("hdfs://nameservice1/user/mshirley/test.json")
    input.saveToEs("mshirley_spark_test/test")
    

    Error:

     org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [Bad Request(400) - Invalid JSON fragment received[["value"]][MapperParsingException[failed to parse]; nested: ElasticsearchParseException[Failed to derive xcontent from (offset=13, length=9): [123, 34, 105, 110, 100, 101, 120, 34, 58, 123, 125, 125, 10, 91, 34, 118, 97, 108, 117, 101, 34, 93, 10]]; ]];
    

    input object:

    res1: org.apache.spark.sql.SchemaRDD = 
    SchemaRDD[6] at RDD at SchemaRDD.scala:108
    == Query Plan ==
    == Physical Plan ==
    PhysicalRDD [key#0], MappedRDD[5] at map at JsonRDD.scala:47
    

    input.printSchema():

    root
     |-- key: string (nullable = true)
    

    Expected result: I expect to be able to read a file from HDFS that contains a JSON document per line and submit that data to ES for indexing.

    bug invalid :Spark v2.1.0.Beta4 
    opened by mshirley 26
  • ES-Hadoop Pig

    ES-Hadoop Pig

    • [ ] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
      The easier it is to track down the bug, the faster it is solved.

    Issue description

    I'm trying to load a json document to elasticsearch and run but it pops out an error

    Description

    Used the below command to load data:

    grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar;
    
    grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray);
    
    grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true');
    

    Steps to reproduce

    Below is the error:

    2017-07-25 12:21:17,674 [Thread-301] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local1500679296_0008
    java.lang.Exception: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [Bad Request(400) - [MapperParsingException[failed to parse]; nested: ElasticsearchParseException[Failed to derive xcontent]; ]]; Bailing out..
    	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
    Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [Bad Request(400) - [MapperParsingException[failed to parse]; nested: ElasticsearchParseException[Failed to derive xcontent]; ]]; Bailing out..
    	at org.elasticsearch.hadoop.rest.RestClient.retryFailedEntries(RestClient.java:203)
    	at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:167)
    	at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:209)
    	at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:232)
    	at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:250)
    	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
    	at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.close(EsOutputFormat.java:196)
    	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.close(PigOutputFormat.java:146)
    	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:670)
    	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
    	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:748)
    2017-07-25 12:21:22,664 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
    

    Version Info

    OS: : CentOS 7 JVM : 1.8.0_131 Hadoop: Hadoop 2.7.3 ES : 1.7.1

    invalid :Pig :Serialization 
    opened by splikhita 25
  • Unable to insert data into ES from Hive - Cannot determine write shards

    Unable to insert data into ES from Hive - Cannot determine write shards

    Hello,

    I'm having a problem inserting data from Hive into ES using ES Hadoop and I cant for the life of my figure out what is going wrong.

    I've created a simple Hive table in a database called dev:

    CREATE external TABLE IF NOT EXISTS dev.carltest (
      TimeTaken                      string
    )
    row format serde 'com.bizo.hive.serde.csv.CSVSerde'
    with serdeproperties (
    "separatorChar" = ","
    )   
    LOCATION '/user/cloudera/carltest'
    

    When I run a select query on this table it pulls back the single vale absolutely fine. I've also created another simple Hive table in a database called elasticsearchtables:

    CREATE EXTERNAL TABLE IF NOT EXISTS elasticsearchtables.carltest (
    TimeTaken string
    )
    STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
    TBLPROPERTIES ('es.resource' = 'proxylog-carltest/event',
                   'es.index.auto.create' = 'true',
                   'es.nodes' = <servername here>,
                   'es.port' = '9200',
                   'es.field.read.empty.as.null' ='true',
                   'es.mapping.names' = 'TimeTaken:TimeTaken'
    )
    

    When I run the following Hive query:

    INSERT OVERWRITE TABLE elasticsearchtables.carltest
    SELECT TimeTaken
    FROM dev.carltest
    

    I get a map reduce error and looking in the hive log it states: Task with the most failures(4):

    Diagnostic Messages for this Task:

    Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"timetaken":"1"}
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"timetaken":"1"}
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:529)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
        ... 8 more
    Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot determine write shards for [proxylog-carltest/event]; likely its format is incorrect (maybe it contains illegal characters?)
        at org.elasticsearch.hadoop.util.Assert.isTrue(Assert.java:50)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:427)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:392)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
        at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:58)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:638)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:847)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:847)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:847)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:519)
        ... 9 more
    

    My ES cluster is running 1.5.2 and i'm using the latest beta release of ES-Hadoop and I'm running CDH 5.1.3

    Any help with this would be greatly appreciated.

    question :Hive v2.2.0-m1 
    opened by pricecarl 25
  • Out of nodes

    Out of nodes

    Me again with another, probably silly, question. Im trying to index many files which are already in the hdfs. Indexing single files is no problem, no I'm passing all the files which should be indexed at once through the command line using the script above. For a few minutes it works fine, until it gives this error:

    ERROR 2997: Encountered IOException. Out of nodes and retries; caught exception
    
    
    register /home/kolbe/elasticsearch-hadoop-1.3.0.M2/dist/elasticsearch-hadoop-1.3.0.M2-yarn.jar
    a = load '$files' using PigStorage() as (json:chararray);
    store a into '$index' using org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true','es.nodes=<adress>:9200');
    
    bug :Pig v2.0.0.RC1 
    opened by Foolius 25
  • Spark connector's implentation of

    Spark connector's implentation of "explode" does not work on nested fields

    Hi everyone,

    What kind an issue is this?

    • [X] Bug report.
    • [ ] Feature Request

    Issue description

    We use Spark to manipulated an array of distinct objects in an ElasticSearch Index. The ElasticSearch index's field is mapped as :

    "array_field": {
            "type": "nested",
            "properties": {
              "property1": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "property2": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "property3": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "property4": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "property5": {
                "type": "date"
              }
            }
          }
    

    When we use the explode Spark function on a dataset created from reading from ElasticSearch the connector generates the following query :

    "query": {
        "bool": {
          "must": [
            {
              "match_all": {}
            }
          ],
          "filter": [
            {
              "exists": {
                "field": "array_field"
              }
            }
          ]
        }
      } 
    

    The "exists" part in the query is generated to differentiate calls of explode and explode_outer because explode drops nulls elements whereas explode_outer keeps them. But since the field is a nested, the query never gets any match because it is not a nested query therefore the dataset is always empty.

    Steps to reproduce

    1. Create an index with a nested mapped field
    2. Put a document with a valued nested field
    3. Read the index from Spark into a dataset
    4. Call Spark explode(field) on the nested field on the dataset
    5. The dataset is empty because the generated query does not match any document

    Version Info

    OS: : Linux JVM : 1.8 Hadoop/Spark: Spark 3.3.0
    ES-Hadoop : elasticsearch-spark-30_2.12:8.2.2

    opened by ThibSCH 0
  • Casting SparkSQL - scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema

    Casting SparkSQL - scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema

    What kind an issue is this?

    • [x] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
      The easier it is to track down the bug, the faster it is solved.
    • [ ] Feature Request. Start by telling us what problem you’re trying to solve.
      Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.

    Issue description

    Hello, I'd try reading data from an instance of elastic search via SparkSQL and it throw a casting exception.

    The provided schema is the following:

    root
     |-- categories: struct (nullable = true)
     |    |-- id: string (nullable = true)
     |    |-- name: string (nullable = true)
    

    Steps to reproduce

    Code:

      val indice: String = "movies"
    
      lazy val customSparkOptions: Map[String, String] = Map(
        "es.nodes" -> "localhost",
        "es.net.http.auth.user" -> "elastic",
        "es.net.http.auth.pass" -> "admin",
        "es.resource" -> indice,
        "es.nodes.wan.only" -> "true",
    
      )
      lazy val sparkConf = new SparkConf().setAll(customSparkOptions)
    
      lazy val spark: SparkSession =
        SparkSession
          .builder()
          .master("local[2]")
          .config(sparkConf)
          .getOrCreate()
    
      val elasticDF: DataFrame = spark.read.format("org.elasticsearch.spark.sql").load(indice).select("categories")
    
      elasticDF.printSchema()
      elasticDF.show(false)
    

    Strack trace:

    java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of struct<id:string,name:string>
    if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else named_struct(id, if (validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, categories), StructField(id,StringType,true), StructField(name,StringType,true)).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, categories), StructField(id,StringType,true), StructField(name,StringType,true)), 0, id), StringType), true, false), name, if (validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, categories), StructField(id,StringType,true), StructField(name,StringType,true)).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, categories), StructField(id,StringType,true), StructField(name,StringType,true)), 1, name), StringType), true, false)) AS categories#28
    	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215)
    	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197)
    	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:127)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:750)
    Caused by: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of struct<id:string,name:string>
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.Invoke_0$(Unknown Source)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.CreateNamedStruct_0$(Unknown Source)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:211)
    	... 19 more
    

    Version Info

    OS: : ubuntu 20.04 JVM : 1.8.0_342 Hadoop/Spark: 3.0.1 ES-Hadoop : 7.12.0 ES : 7.12.0 Scala : 2.12.8

    Feature description

    opened by rafafrdz 3
  • Update configuration.adoc

    Update configuration.adoc

    Thank you for submitting a pull request!

    Please make sure you have signed our Contributor License Agreement (CLA).
    We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code.
    You only need to sign the CLA once.

    >docs 
    opened by cbc666 2
  • Support for ArrayType in es.update.script.params

    Support for ArrayType in es.update.script.params

    What kind an issue is this?

    • [x] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
      The easier it is to track down the bug, the faster it is solved.
    • [ ] Feature Request. Start by telling us what problem you’re trying to solve.
      Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.

    Issue description

    Arrays fields are not working with ES Update script. tags is an array in shared example and it fails in transformation stage. Do we have reference to use ArrayType with es.update.script.params

    Steps to reproduce

    Code:

            List<String> data = Collections.singletonList("{\"Context\":\"129\",\"MessageType\":{\"id\":\"1013\",\"content\":\"Hello World\"},\"Module\":\"1203\",\"time\":3249,\"critical\":0,\"id\":1, \"tags\":[\"user\",\"device\"]}");
    
            SparkConf sparkConf = new SparkConf();
    
            sparkConf.set("es.nodes", "localhost");
            sparkConf.set("es.port", "9200");
            sparkConf.set("es.net.ssl", "false");
            sparkConf.set("es.nodes.wan.only", "true");
    
            SparkSession session = SparkSession.builder().appName("SparkElasticSearchTest").master("local[*]").config(sparkConf).getOrCreate();
    
            Dataset<Row> df = session.createDataset(data, Encoders.STRING()).toDF();
            Dataset<String> df1 = df.as(Encoders.STRING());
            Dataset<Row> df2 = session.read().json(df1.javaRDD());
    
            df2.printSchema();
            df2.show(false);
    
            String script = "ctx._source.Context = params.Context; ctx._source.Module = params.Module; ctx._source.critical = params.critical; ctx._source.id = params.id; if (ctx._source.time == null) {ctx._source.time = params.time} ctx._source.MessageType = new HashMap(); ctx._source.MessageType.put('id', params.MessageTypeId); ctx._source.MessageType.put('content', params.MessageTypeContent); ctx._source.tags = params.tags";
    
            String ja = "MessageTypeId:MessageType.id, MessageTypeContent:MessageType.content, Context:Context, Module:Module, time:time, critical:critical, id:id, tags:tags";
    
            DataFrameWriter<Row> dsWriter = df2.write()
                    .format("org.elasticsearch.spark.sql")
                    .option(ConfigurationOptions.ES_NODES, "localhost")
                    .option(ConfigurationOptions.ES_PORT, "9200")
                    .option(ConfigurationOptions.ES_NET_USE_SSL, false)
                    .option(ConfigurationOptions.ES_NODES_WAN_ONLY, "true")
                    .option(ConfigurationOptions.ES_MAPPING_ID, "id")
                    .option(ConfigurationOptions.ES_WRITE_OPERATION, ConfigurationOptions.ES_OPERATION_UPSERT)
                    .option(ConfigurationOptions.ES_UPDATE_SCRIPT_UPSERT, "true")
                    .option(ConfigurationOptions.ES_UPDATE_SCRIPT_INLINE, script)
                    .option(ConfigurationOptions.ES_UPDATE_SCRIPT_LANG, "painless")
                    .option(ConfigurationOptions.ES_UPDATE_SCRIPT_PARAMS, ja);
    
    
            dsWriter.mode("append");
            dsWriter.save("user-details");
    
    

    Strack trace:

    org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
    	at org.elasticsearch.hadoop.serialization.bulk.BulkEntryWriter.writeBulkEntry(BulkEntryWriter.java:136)
    	at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:170)
    	at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:83)
    	at org.elasticsearch.spark.sql.EsSparkSQL$.$anonfun$saveToEs$1(EsSparkSQL.scala:103)
    	at org.elasticsearch.spark.sql.EsSparkSQL$.$anonfun$saveToEs$1$adapted(EsSparkSQL.scala:103)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
    	at org.elasticsearch.spark.sql.DataFrameValueWriter.writeArray(DataFrameValueWriter.scala:75)
    	at org.elasticsearch.spark.sql.DataFrameValueWriter.write(DataFrameValueWriter.scala:69)
    	at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.doWrite(AbstractBulkFactory.java:153)
    	at org.elasticsearch.hadoop.serialization.bulk.AbstractBulkFactory$FieldWriter.write(AbstractBulkFactory.java:123)
    	at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.writeTemplate(TemplatedBulk.java:80)
    	at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(TemplatedBulk.java:56)
    	at org.elasticsearch.hadoop.serialization.bulk.BulkEntryWriter.writeBulkEntry(BulkEntryWriter.java:68)
    	... 12 more
    
    

    Version Info

    OS: : Any JVM : 1.8 Hadoop/Spark: 3.2.1
    ES-Hadoop : 8.2.2 ES : 7.10.2

    Feature description

    opened by vararo27 2
  • Add an option to pass a truststore in a format of a base64 string

    Add an option to pass a truststore in a format of a base64 string

    What kind an issue is this?

    • [ ] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
      The easier it is to track down the bug, the faster it is solved.
    • [x] Feature Request. Start by telling us what problem you’re trying to solve.
      Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.

    Feature description

    The current implementation of passing TrustStore is limited to a path to certain storage - either hdfs or some other location elasticsearch-hadoop could access. It burdens the user with uploading and maintaining the file somewhere, which can be quite a challenge in the case of many connections.

    To simplify the process, Truststore could be loaded from a base64 string, decrypted, and loaded inside the elasticsearch-hadoop package. Of course, such functionality could be allowed on top of the existing solution as an additional parameter.

    opened by M0arcin 1
Releases(v8.5.3)
  • v8.5.3(Dec 8, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.5/eshadoop-8.5.3.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.8(Dec 8, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.8.html

    Source code(tar.gz)
    Source code(zip)
  • v8.5.2(Nov 22, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.5/eshadoop-8.5.2.html

    Source code(tar.gz)
    Source code(zip)
  • v8.5.1(Nov 15, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.5/eshadoop-8.5.1.html

    Source code(tar.gz)
    Source code(zip)
  • v8.5.0(Nov 1, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.5/eshadoop-8.5.0.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.7(Oct 25, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.7.html

    Source code(tar.gz)
    Source code(zip)
  • v8.4.3(Oct 5, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.4/eshadoop-8.4.3.html

    Source code(tar.gz)
    Source code(zip)
  • v8.4.2(Sep 20, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.4/eshadoop-8.4.2.html

    Source code(tar.gz)
    Source code(zip)
  • v8.4.1(Aug 30, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.4/eshadoop-8.4.1.html

    Source code(tar.gz)
    Source code(zip)
  • v8.4.0(Aug 24, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.4/eshadoop-8.4.0.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.6(Aug 24, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.6.html

    Source code(tar.gz)
    Source code(zip)
  • v8.3.3(Jul 28, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.3/eshadoop-8.3.3.html

    Source code(tar.gz)
    Source code(zip)
  • v8.3.2(Jul 7, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.3/eshadoop-8.3.2.html

    Source code(tar.gz)
    Source code(zip)
  • v8.3.1(Jun 30, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.3/eshadoop-8.3.1.html

    Source code(tar.gz)
    Source code(zip)
  • v8.3.0(Jun 28, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.3/eshadoop-8.3.0.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.5(Jun 28, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.5.html

    Source code(tar.gz)
    Source code(zip)
  • v8.2.3(Jun 14, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.2/eshadoop-8.2.3.html

    Source code(tar.gz)
    Source code(zip)
  • v8.2.2(May 26, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.2/eshadoop-8.2.2.html

    Source code(tar.gz)
    Source code(zip)
  • v8.2.1(May 24, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.2/eshadoop-8.2.1.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.4(May 24, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.4.html

    Source code(tar.gz)
    Source code(zip)
  • v8.2.0(May 3, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.2/eshadoop-8.2.0.html

    Source code(tar.gz)
    Source code(zip)
  • v8.1.3(Apr 20, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.1/eshadoop-8.1.3.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.3(Apr 20, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.3.html

    Source code(tar.gz)
    Source code(zip)
  • v8.1.2(Mar 31, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.1/eshadoop-8.1.2.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.2(Mar 31, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.2.html

    Source code(tar.gz)
    Source code(zip)
  • v8.1.1(Mar 22, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.1/eshadoop-8.1.1.html

    Source code(tar.gz)
    Source code(zip)
  • v8.1.0(Mar 8, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.1/eshadoop-8.1.0.html

    Source code(tar.gz)
    Source code(zip)
  • v8.0.1(Mar 1, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.0/eshadoop-8.0.1.html

    Source code(tar.gz)
    Source code(zip)
  • v7.17.1(Feb 28, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/7.17/eshadoop-7.17.1.html

    Source code(tar.gz)
    Source code(zip)
  • v8.0.0(Feb 10, 2022)

    Downloads: https://elastic.co/downloads/hadoop Release notes: https://www.elastic.co/guide/en/elasticsearch/hadoop/8.0/eshadoop-8.0.0.html

    Source code(tar.gz)
    Source code(zip)
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Nikolas Petrou 2 Dec 4, 2022
PageRank implementation in hadoop

PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script

Maksym Zub 1 Jan 24, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Path Finding Visualizer for Breadth first search, Depth first search, Best first search and A* search made with java swing

Path-Finding-Visualizer Purpose This is a tool to visualize search algorithms Algorithms featured Breadth First Search Deapth First Search Gready Best

Leonard 11 Oct 20, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022