Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Overview

Elephant Bird Build Status

About

Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.

Join the conversation about Elephant-Bird on the developer mailing list.

License

Apache License, Version 2.0.

Quickstart

  1. Make sure you have Protocol Buffers installed. Please see Version compatibility section below.
  2. Make sure you have Apache Thrift installed. Please see Version compatibility section below.
  3. Get the code: git clone git://github.com/twitter/elephant-bird.git
  4. Build the jar: mvn package
  5. Explore what's available: mvn javadoc:javadoc

Note: For any of the LZO-based code, make sure that the native LZO libraries are on your java.library.path. Generally this is done by setting JAVA_LIBRARY_PATH in pig-env.sh or hadoop-env.sh. You can also add lines like

PIG_OPTS=-Djava.library.path=/path/to/my/libgplcompression/dir

to pig-env.sh. See the instructions for Hadoop-LZO for more details.

There are a few simple examples that use the input formats. Note how the Protocol Buffer and Thrift classes are passed to input formats through configuration.

Maven repository

Elephant Bird release artifacts are published to the Sonatype OSS releases repository and promoted from there to Maven Central. From time to time we may also deploy snapshot releases to the Sonatype OSS snapshots repository.

Version compatibility

  1. Hadoop 20.2x, 1.x, 2.x
  2. Pig 0.8+
  3. Protocol Buffers 2.5.0, 2.4.1, 2.3.0 (default build version is 2.4.1 can be changed with -Dprotobuf.version=2.3.0)
  4. Hive 0.7 (with HIVE-1616)
  5. Thrift 0.5.0, 0.6.0, 0.7.0, greater versions than 0.9 are provided via thrift9 maven profile
  6. Mahout 0.6
  7. Cascading2 (as the API is evolving, see libraries.properties for the currently supported version)
  8. Crunch 0.8.1+

Runtime Dependencies

Elephant-Bird defines majority of its depenendencies in maven provided scope. As a result these dependencies are not transitively Elephant-Bird modules. Please see wiki page for more information.

Contents

Hadoop Input and Output Formats

Elephant-Bird provides input and output formats for working with a variety of plaintext formats stored in LZO compressed files.

  • JSON data
  • Line-based data (TextInputFormat but for LZO)
  • W3C logs

Additionally, protocol buffers and thrift messages can be stored in a variety of file formats.

  • Block-based, into generic bytes
  • Line-based, base64 encoded
  • SequenceFile
  • RCFile

Hadoop API wrappers

Hadoop provides two API implementations: the the old-style org.apache.hadoop.mapred and new-style org.apache.hadoop.mapreduce packages. Elephant-Bird provides wrapper classes that allow unmodified usage of mapreduce input and output formats in contexts where the mapred interface is required.

For more information, see DeprecatedInputFormatWrapper.java and DeprecatedOutputFormatWrapper.java

Hadoop 2.x Support

Elephant-bird published packages are tested with both Hadoop 1.x and 2.x.

Hadoop Writables

  • Elephant-Bird provides protocol buffer and thrift writables for directly working with these formats in map-reduce jobs.

Pig Support

Loaders and storers are available for the input and output formats listed above. Additionally, pig-specific features include:

  • JSON loader (including nested structures)
  • Regex-based loader
  • Includes converter interface for turning Tuples into Writables and vice versa
  • Provides implementations to convert generic Writables, Thrift, Protobufs, and other specialized classes, such as Apache Mahout's VectorWritable.

Hive Support

Elephant-Bird provides Hive support for reading thrift and protocol buffers. For more information, see How to use Elephant Bird with Hive.

Lucene Integration

Elephant-Bird provides hadoop Input/Output Formats and pig Load/Store Funcs for creating + searching lucene indexes. See Elephant Bird Lucene

Utilities

  • Counters in Pig
  • Protocol Buffer utilities
  • Thrift utilities
  • Conversions from Protocol Buffers and Thrift messages to Pig tuples
  • Conversions from Thrift to Protocol Buffer's DynamicMessage
  • Reading and writing block-based Protocol Buffer format (see ProtobufBlockWriter)

Protocol Buffer and Thrift compiler dependencies

Elephant Bird requires Protocol Buffer compiler at build time, as generated classes are used internally. Thrift compiler is required to generate classes used in tests. As these are native-code tools they must be installed on the build machine (java library dependencies are pulled from maven repositories during the build).

Working with Thrift and Protocol Buffers in Hadoop

We provide InputFormats, OutputFormats, Pig Load / Store functions, Hive SerDes, and Writables for working with Thrift and Google Protocol Buffers. We haven't written up the docs yet, but look at ProtobufMRExample.java, ThriftMRExample.java, people_phone_number_count.pig, people_phone_number_count_thrift.pig under examples directory for reflection-based dynamic usage. We also provide utilities for generating Protobuf-specific Loaders, Input/Output Formats, etc, if for some reason you want to avoid the dynamic bits.

Hadoop SequenceFiles and Pig

Reading and writing Hadoop SequenceFiles with Pig is supported via classes SequenceFileLoader and SequenceFileStorage. These classes make use of a WritableConverter interface, allowing pluggable conversion of key and value instances to and from Pig data types.

Here's a short example: Suppose you have SequenceFile<Text, LongWritable> data sitting beneath path input. We can load that data with the following Pig script:

REGISTER '/path/to/elephant-bird.jar';

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';

pairs = LOAD 'input' USING $SEQFILE_LOADER (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
) AS (key: chararray, value: long);

To store {key: chararray, value: long} data as SequenceFile<Text, LongWritable>, the following may be used:

%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';

STORE pairs INTO 'output' USING $SEQFILE_STORAGE (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
);

For details, please see Javadocs in the following classes:

How To Contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on github.

Each new release since 2.1.3 has a tag. The latest version on master is what we are actively running on Twitter's hadoop clusters daily, over hundreds of terabytes of data.

Contributors

Major contributors are listed below. Lots of others have helped too, thanks to all of them! See git logs for credits.

Comments
  • Split elephant-bird into modules

    Split elephant-bird into modules

    I split the project into core, pig, hive, cascading2 and rcfile modules. This will make it easier to depend on only the parts you need without getting a metric ton of extra jar files. The rcfile module was only since it needs both hive and pig.

    I switched the project to Maven in order to easier modularize it. I don't really care what is used though, this just seemed easier. Added two plugins for thrift and protobuf code gen from Maven.

    This branch is almost fully baked, but there are a couple of nits. I probably won't have time to get this all the way to the finish line, but it's almost there. Hopefully someone else can help out with these.

    • testStableHashcodeAcrossJVMs in TestProtobufWritable is commented out. This was using some Ant dependent way of launching another JVM. Should be doable to make this work without Ant though.
    • Maven complains about directly using the artifacts in lib/, but most of those have been discussed in another issue (frankenjar etc). It seems to work anyway.
    • All tests pass, but I have not tested any of the artifacts with a real Hadoop job to see if it actually works
    opened by johanoskarsson 47
  • Close cloned HDFSIndexInput instances

    Close cloned HDFSIndexInput instances

    When using inside a webapp, there were a lot of open files in (CLOSED_WAIT state) for the index files on HDFS for every request. A lot of open files eventually chokes the host. Also, the FileSystem should not be closed as it is passed from outside and it could be used by others.

    opened by ankurbarua 23
  • Signature is null in SequenceFileLoader

    Signature is null in SequenceFileLoader

    I have a similar problem as in #112, but with the SequenceFileLoader in the mapreduce mode.

    The signature is null and a command fails because of that. The error is:

    java.lang.NullPointerException: Signature is null
        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
        at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperties(SequenceFileLoader.java:296)
        at com.twitter.elephantbird.pig.load.SequenceFileLoader.getContextProperty(SequenceFileLoader.java:306)
        at com.twitter.elephantbird.pig.load.SequenceFileLoader.setLocation(SequenceFileLoader.java:411)
    

    The commands leading to this error are:

    A  = load 'hdfs://XXX'  USING com.twitter.elephantbird.pig.load.SequenceFileLoader (
        '-c     com.twitter.elephantbird.pig.util.TextConverter',                                                              
        '-c com.twitter.elephantbird.pig.util.ProtobufWritableConverter de.pc2.dedup.fschunk.pig.PigProtocol.File');       
    ILLUSTRATE A;
    
    opened by dmeister 21
  • API incompatibility with Hadoop 2.x JobContext

    API incompatibility with Hadoop 2.x JobContext

    I'm trying to use ElephantBird (trunk and/or 2.2.3) with CDH 4.0.1 and am getting an exception using the LZO input formats (I copied the EB core JAR to /usr/lib/hadoop/lib prior to running the test):

    # EB compiled, and JAR copied to /usr/lib/hadoop/lib
    hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.0.1-examples.jar \
      sort \
      -inFormat com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat \
      -outFormat org.apache.hadoop.mapred.TextOutputFormat \
      -outKey org.apache.hadoop.io.LongWritable \
      -outValue org.apache.hadoop.io.Text \
      /user/aholmes/small-traffic-file.txt.lzo  output
    
    java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getSplits(DeprecatedInputFormatWrapper.java:100)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
    at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
    ...
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
    

    This exception is a result of the following code in DeprecatedInputFormatWrapper:

    List<org.apache.hadoop.mapreduce.InputSplit> splits =
        realInputFormat.getSplits(new JobContext(job, null));
    

    With Hadoop 2.x the JobContext class changed from a concrete class to an interface:

    Hadoop 1.0.3: JobContext is a class Hadoop 2.0.0: JobContext is an interface

    opened by alexholmes 19
  • Getting error in EB LzoProtobufB64LinePigStorage storefunc.

    Getting error in EB LzoProtobufB64LinePigStorage storefunc.

    Hi Raghu,

    Facing error in EB storefunc,

    3779 at com.test.sample.data.Sample$Testing.getSerializedSize(Sample.java:19967) 3780 at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62) 3781 at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.toBytes(ProtobufConverter.java:73) 3782 at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.toBytes(ProtobufConverter.java:15) 3783 at com.twitter.elephantbird.mapreduce.output.LzoBinaryB64LineRecordWriter.write(LzoBinaryB64LineRecordWriter.java:42) 3784 at com.twitter.elephantbird.mapreduce.output.LzoBinaryB64LineRecordWriter.write(LzoBinaryB64LineRecordWriter.java:26) 3785 at com.twitter.elephantbird.pig.store.LzoProtobufB64LinePigStorage.putNext(LzoProtobufB64LinePigStorage.java:55) 3786 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) 3787 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) 3788 at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) 3789 at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) 3790 at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)

    I have proto with multiple message blocks. How to access them in latest EB in storefunc of LzoProtobuffB64LinePigStore? I'm refer as follows,

    com.twitter.elephantbird.pig.store.LzoProtobufB64LinePigStorage('com.test.sample.data.Sample.Testing')

    Earlier I used com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore it works well(i.e., com.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore('Testing')), but in latest version this class not found, even I loaded the below jars,

    elephant-bird-hadoop-compat-4.4.jar elephant-bird-pig-4.4.jar elephant-bird-core-4.4.jar

    Pls help.

    opened by viswaj 18
  • upgrade common-codec from 1.3 to 1.4 to explictly eliminate misuse.

    upgrade common-codec from 1.3 to 1.4 to explictly eliminate misuse.

    Hi Guys,

    Just as the title, the change tries to prevent some silent errors if misuse happens.

    The Base64 default constructor has different behavior between common-codec-1.3 and common-codec-1.4. In version 1.4. the default constructor enables the "chunking" ability which is introduced in this version for the first time, but not in common-codec-1.3.

    When the common-codec-1.4 precedes common-codec-1.3 in Java classpath, which is often the case because of 1.4 in Hadoop's classpath, and the if some line length exceeds the CHUNK_SIZE(default to 76), this line gives birth to an unexpected multiple-line output.

    opened by angushe 18
  • Prevent NPE in TStructDescriptor when an unexpected enum is encountered ...

    Prevent NPE in TStructDescriptor when an unexpected enum is encountered ...

    ...by returning an empty map.

    When an out-of-date thrift definition is used, and a new enum field is encountered, currently the enum map can be null, which results in an NPE. It seems like it is better to return an empty map and let the caller deal with the fact that the map doesn't contain all the enums the caller will encounter (the caller is then free to increment error counters, etc).

    opened by dvryaboy 17
  • upgrade protobuf compiler and library to 2.5.0

    upgrade protobuf compiler and library to 2.5.0

    It seems current elephand-bird still uses 2.4.1 protobuf compiler and library, which was 2.5 years old. We are currently using 2.5.0 protobuf library in our project. Unfortunately, library of 2.5.0 does not fully support auto-generated protobuf java codes using 2.4.1 compiler.

    This caused an issue when we tried to use SerializedBlock class in block_storage.proto. Error messages look like:

    java.lang.UnsupportedOperationException: This is supposed to be overridden by subclasses at com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180) at com.twitter.data.proto.BlockStorage$SerializedBlock.getSerializedSize(BlockStorage.java:164) at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)

    Will you consider to upgrade protobuf compiler and library in elephandbird?

    opened by simonandluna 17
  • Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext

    Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext

    Replaces checkNotNull in SequenceFileStorage.putNext(...) with warnings to skip null inputs

    This allows null inputs to be skipped at runtime instead of running into a NPE, and possibly failing an entire Pig pipeline. Requested by Jake Mannix.

    opened by sagemintblue 17
  • Fixes #452 - supporting thrift 0.7 and 0.9

    Fixes #452 - supporting thrift 0.7 and 0.9

    An attempt to #452 - support thrift 0.7 and 0.9+ by using separate source trees enabled via maven profiles. By default the enabled profile is thrift7 to remain compatible with the current behaviour/use. Building against thrift 9 can be done by enabling thrift9 profile.

    The code that is incompatible has been isolated so most future changes can benefit to both versions. I have tested it against thrift 0.9.1 and 0.7.0.

    What this doesn't handle yet is a dual release...this could be achieved using classifiers. But first I want to have those changes validated before attempting to make sure the release works properly.

    @ianoc can you please have a look at this PR.

    opened by EugenCepoi 15
  • ThriftBinaryDeserializer incompatible with thrift 0.9.1+

    ThriftBinaryDeserializer incompatible with thrift 0.9.1+

    Since thrift 0.9.1 the method setReadLength from TBinaryProtocol has been removed but ThriftBinaryDeserializer is still using it.

    I couldn't find why it has been removed or any replacement to it. This issue happens when using emr 4 and spark 1.5 as they use thrift 0.9.2. If "the fix" is to just remove the calls to that method and make sure it works with latest thrift releases I can open a PR.

    opened by EugenCepoi 12
  • Bump protobuf-java from 2.4.1 to 3.16.3

    Bump protobuf-java from 2.4.1 to 3.16.3

    Bumps protobuf-java from 2.4.1 to 3.16.3.

    Release notes

    Sourced from protobuf-java's releases.

    Protobuf Release v3.16.3

    Java

    • Refactoring java full runtime to reuse sub-message builders and prepare to migrate parsing logic from parse constructor to builder.
    • Move proto wireformat parsing functionality from the private "parsing constructor" to the Builder class.
    • Change the Lite runtime to prefer merging from the wireformat into mutable messages rather than building up a new immutable object before merging. This way results in fewer allocations and copy operations.
    • Make message-type extensions merge from wire-format instead of building up instances and merging afterwards. This has much better performance.
    • Fix TextFormat parser to build up recurring (but supposedly not repeated) sub-messages directly from text rather than building a new sub-message and merging the fully formed message into the existing field.
    • This release addresses a Security Advisory for Java users

    Protocol Buffers v3.16.1

    Java

    • Improve performance characteristics of UnknownFieldSet parsing (#9371)

    Protocol Buffers v3.16.0

    C++

    • Fix compiler warnings issue found in conformance_test_runner #8189 (#8190)
    • Fix MinGW-w64 build issues. (#8286)
    • [Protoc] C++ Resolved an issue where NO_DESTROY and CONSTINIT are in incorrect order (#8296)
    • Fix PROTOBUF_CONSTINIT macro redefinition (#8323)
    • Delete StringPiecePod (#8353)
    • Fix gcc error: comparison of unsigned expression in '>= 0' is always … (#8309)
    • Fix cmake install on iOS (#8301)
    • Create a CMake option to control whether or not RTTI is enabled (#8347)
    • Fix endian.h location on FreeBSD (#8351)
    • Refactor util::Status (#8354)
    • Make util::Status more similar to absl::Status (#8405)
    • Fix -Wsuggest-destructor-override for generated C++ proto classes. (#8408)
    • Refactor StatusOr and StringPiece (#8406)
    • Refactor uint128 (#8416)
    • The ::pb namespace is no longer exposed due to conflicts.
    • Allow MessageDifferencer::TreatAsSet() (and friends) to override previous calls instead of crashing.
    • Reduce the size of generated proto headers for protos with string or bytes fields.
    • Move arena() operation on uncommon path to out-of-line routine
    • For iterator-pair function parameter types, take both iterators by value.
    • Code-space savings and perhaps some modest performance improvements in RepeatedPtrField.
    • Eliminate nullptr check from every tag parse.
    • Remove unused _$name$cached_byte_size fields.
    • Serialize extension ranges together when not broken by a proto field in the middle.
    • Do out-of-line allocation and deallocation of string object in ArenaString.

    ... (truncated)

    Commits
    • b8c2488 Updating version.json and repo version numbers to: 16.3
    • 42e47e5 Refactoring Java parsing (3.16.x) (#10668)
    • 98884a8 Merge pull request #10556 from deannagarcia/3.16.x
    • 450b648 Cherrypick ruby fixes for monterey
    • b17bb39 Merge pull request #10548 from protocolbuffers/3.16.x-202209131829
    • c18f5e7 Updating changelog
    • 6f4e817 Updating version.json and repo version numbers to: 16.2
    • a7d4e94 Merge pull request #10547 from deannagarcia/3.16.x
    • 55815e4 Apply patch
    • 152d7bf Update version.json with "lts": true (#10535)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Why using /2/users/by/username/:username for getting user tweets

    Why using /2/users/by/username/:username for getting user tweets

    I created a question in the Twitter forum for a discordance between rate limit in the documentation and rate limit in the API. See here.

    But here, my question is for you.

    My code for getting tweets same as documentation :

    $twitter_name = "elonmusk";
    $twitter_user = $twitter->user($twitter_name)->tweets($params);
    

    And the called API URL is : GET https://api.twitter.com/2/users/by/username/elonmusk So why are you using GET /2/users/by/username/:username for getting a user tweet instead of GET /2/users/:id/tweets ? I'm asking because maybe the discordance seen above is only for GET /2/users/by/username/:username and not for GET /2/users/:id/tweets.

    Thanks for your feedback.

    opened by bastienuh 0
  • Bump commons-io from 2.4 to 2.7

    Bump commons-io from 2.4 to 2.7

    Bumps commons-io from 2.4 to 2.7.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump libthrift from 0.10.0 to 0.14.0

    Bump libthrift from 0.10.0 to 0.14.0

    Bumps libthrift from 0.10.0 to 0.14.0.

    Release notes

    Sourced from libthrift's releases.

    Version 0.14.0

    For release 0.14.0 head over to the official release download source: http://thrift.apache.org/download

    The assets below are added by Github based on the release tag and they may therefore not match the checkums.

    Version 0.13.0

    For release 0.13.0 head over to the official release download source: http://thrift.apache.org/download

    The assets below are added by Github based on the release tag and they may therefore not match the checkums.

    Version 0.12.0

    Apache Thrift Release 0.12.0

    Changelog

    Sourced from libthrift's changelog.

    0.14.0

    Deprecated Languages

    Removed Languages

    • THRIFT-4980 - Remove deprecated C# and netcore bindings from the code base
    • THRIFT-4981 - Remove deprecated netcore bindings from the code base
    • THRIFT-4982 - Remove deprecated C# bindings from the code base

    Breaking Changes

    • THRIFT-4981 - Remove deprecated netcore bindings from the code base
    • THRIFT-4982 - Remove deprecated csharp bindings from the code base
    • THRIFT-4990 - Upgrade to .NET Core 3.1 (LTS)
    • THRIFT-5006 - Implement DEFAULT_MAX_LENGTH at TFramedTransport
    • THRIFT-5069 - In Go library TDeserializer.Transport is now typed *TMemoryBuffer instead of TTransport
    • THRIFT-5072 - Haskell generator fails to distinguish between multiple enum types with conflicting enum identifiers
    • THRIFT-5116 - Upgrade NodeJS to 10.x
    • THRIFT-5138 - Swift generator does not escape keywords properly
    • THRIFT-5164 - In Go library TProcessor interface now includes ProcessorMap and AddToProcessorMap functions.
    • THRIFT-5186 - cpp: use all getaddrinfo() results when retrying failed bind() in T{Nonblocking,}ServerSocket
    • THRIFT-5233 - go: Now all Read*, Write* and Skip functions in TProtocol accept context arg
    • THRIFT-5152 - go: TSocket and TSSLSocket now have separated connect timeout and socket timeout
    • c++: dropped support for Windows XP
    • THRIFT-5326 - go: TException interface now has a new function: TExceptionType
    • THRIFT-4914 - go: TClient.Call now returns ResponseMeta in addition to error

    Known Open Issues (Blocker or Critical)

    • THRIFT-3877 - C++: library don't work with HTTP (csharp server, cpp client; need cross test enhancement)
    • THRIFT-5098 - Deprecated: "The high level Network interface is no longer supported. Please use Network.Socket." and other Haskell issues
    • THRIFT-5245 - NPE when the value of map's key is null
    • THRIFT-4687 - Add thrift 0.12.0 to pypi and/or enable more maintainers

    Build Process

    • THRIFT-4976 - Docker build: Test failure for StalenessCheckTest on MacOS
    • THRIFT-5087 - test/test.py fails with "AssertionError: Python 3.3 or later is required for proper operation."
    • THRIFT-5097 - Incorrect THRIFT_VERSION in ThriftConfig.cmake
    • THRIFT-5109 - Misc CMake improvements
    • THRIFT-5147 - Add uninstall function
    • THRIFT-5218 - Automated Github release artifacts do not match checksums provided
    • THRIFT-5249 - travis-ci : Failed to run FastbinaryTest.py

    C glib

    ... (truncated)

    Commits
    • 8411e18 Version 0.14.0
    • 0be1b7d Version 0.14.0
    • 705f377 Version 0.14.0
    • ebfa771 THRIFT-5274: Enforce Java 8 compatibility
    • 518163a Update README.md
    • de523c7 Updated CHANGES to reflect Version 0.14.0
    • 7ae1ec3 THRIFT-5297: Improve TThreadPoolServer Handling of Incoming Connections
    • ebc2ab5 THRIFT-5345: Allow the ServerContext to be Unwrapped Programmatically
    • 55016bf THRIFT-5343: TTlsSocketTransport does not resolve IPv4 addresses or validate ...
    • 4aaef75 THRIFT-5337 Go set fields write improvement
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump junit from 4.8.2 to 4.13.1

    Bump junit from 4.8.2 to 4.13.1

    Bumps junit from 4.8.2 to 4.13.1.

    Release notes

    Sourced from junit's releases.

    JUnit 4.13.1

    Please refer to the release notes for details.

    JUnit 4.13

    Please refer to the release notes for details.

    JUnit 4.13 RC 2

    Please refer to the release notes for details.

    JUnit 4.13 RC 1

    Please refer to the release notes for details.

    JUnit 4.13 Beta 3

    Please refer to the release notes for details.

    JUnit 4.13 Beta 2

    Please refer to the release notes for details.

    JUnit 4.13 Beta 1

    Please refer to the release notes for details.

    JUnit 4.12

    Please refer to the release notes for details.

    JUnit 4.12 Beta 3

    Please refer to the release notes for details.

    JUnit 4.12 Beta 2

    No release notes provided.

    JUnit 4.12 Beta 1

    No release notes provided.

    JUnit 4.11

    No release notes provided.

    Changelog

    Sourced from junit's changelog.

    Summary of changes in version 4.13.1

    Rules

    Security fix: TemporaryFolder now limits access to temporary folders on Java 1.7 or later

    A local information disclosure vulnerability in TemporaryFolder has been fixed. See the published security advisory for details.

    Test Runners

    [Pull request #1669:](junit-team/junit#1669) Make FrameworkField constructor public

    Prior to this change, custom runners could make FrameworkMethod instances, but not FrameworkField instances. This small change allows for both now, because FrameworkField's constructor has been promoted from package-private to public.

    Commits
    • 1b683f4 [maven-release-plugin] prepare release r4.13.1
    • ce6ce3a Draft 4.13.1 release notes
    • c29dd82 Change version to 4.13.1-SNAPSHOT
    • 1d17486 Add a link to assertThrows in exception testing
    • 543905d Use separate line for annotation in Javadoc
    • 510e906 Add sub headlines to class Javadoc
    • 610155b Merge pull request from GHSA-269g-pwp5-87pp
    • b6cfd1e Explicitly wrap float parameter for consistency (#1671)
    • a5d205c Fix GitHub link in FAQ (#1672)
    • 3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump apache.lucene.version from 4.0.0 to 8.6.0

    Bump apache.lucene.version from 4.0.0 to 8.6.0

    Bumps apache.lucene.version from 4.0.0 to 8.6.0. Updates lucene-core from 4.0.0 to 8.6.0

    Updates lucene-analyzers-common from 4.0.0 to 8.6.0

    Updates lucene-queryparser from 4.0.0 to 8.6.0

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
Releases(elephant-bird-4.15)
  • elephant-bird-4.15(Mar 9, 2017)

  • elephant-bird-4.14(Jun 21, 2016)

  • elephant-bird-4.13(Feb 8, 2016)

    • Adds Cascading 3 support #463
    • Restores API compatibility for ThriftBinaryProtocol/ThriftBinaryDeserializer #461

    Upgrade notes:

    1. The following have been moved from elephant-bird-cascading2 to elephant-bird-cascading-protobuf module: ProtobufComparator, ProtobufDeserializer, ProtobufReflectionUtil, ProtobufSerialization, ProtobufSerializer Namespace change: com.twitter.elephantbird.cascading2.io.protobuf => com.twitter.elephantbird.cascading.protobuf
    2. cascading-hadoop is now marked as provided, so if you depend on elephant-bird-cascading2, you should explicitly add it to your build deps.
    Source code(tar.gz)
    Source code(zip)
  • elephant-bird-4.12(Jan 6, 2016)

  • elephant-bird-4.10(Sep 2, 2015)

    This release contains a single bugfix:

    • Add container size check in ThriftBinaryProtocol https://github.com/twitter/elephant-bird/pull/448
    Source code(tar.gz)
    Source code(zip)
  • elephant-bird-4.9(May 27, 2015)

    Change log:

    Issue 444. Throw DecodeException in Base64Codec (Ruban Monu) Issue 442. Add some options so we can influence the compression options of intermediate data written by cascading (Ian O'Connell)

    Source code(tar.gz)
    Source code(zip)
  • elephant-bird-4.8(May 13, 2015)

    This release contains an important bug fix in the SerializedBlock and BinaryRecordReader.

    Change log: Issue 441. Use CodedInputStream in SerializedBlock to fix the max size limit check (Ruban Monu)

    Source code(tar.gz)
    Source code(zip)
  • elephant-bird-4.7(May 11, 2015)

    This release includes new Generic block record readers for Lzo compressed protobuf data. It also contains a change to make minimum indexable file size configurable for Lzo output and performance improvements for reading Lzo indexes and splits.

    Note: BinaryConverter now throws DecodeException if deserializing a record fails, instead of returning null.

    Change log: Issue 440. LzoGenericBlockOutputFormat (Ruban Monu) Issue 439. Adds generic block record readers (Ruban Monu) Issue 435. Faster working with LzoBinary data (Ian O'Connell) Issue 434. Speed up getSplits by reusing FileStatus'es from the very first listStatus (Gera Shegalov) Issue 430. Configurable minimum indexable file size (Gera Shegalov)

    Source code(tar.gz)
    Source code(zip)
  • elephant-bird-4.6(Mar 5, 2015)

    This release includes a critical fix to avoid double reading first block in split in block-format.

    It also uses dynamic protobuf instead of code-generating a protobuf for the block-format, has some performance improvements in base64 codepaths, and more!

    Here's the full change log: Issue 429 Avoid double reading first block in split in block-format (Raghu Angadi) Issue 421. Expose FileDescriptor and FieldDescriptor (Brian Ramos) Issue 423. Don't copy the array before handing it for base64 decode (Ian O'Connell) Issue 422. Pulls in the source of a BSD licenced base64 implementation that is 5x faster than the Apache one for our usage (Ian O'Connell) Issue 418. Use dynamic protobufs (Remove protobufs) (Raghu Angadi) Issue 417. A Cascading scheme for combining intermediate sequence files (Akihiro Matsukawa) Issue 414. Fix typo in docs (thrift, not thrist) (gstaubli) Issue 413. Trivial Javadocs for LuceneIndexInputFormat (Lewis John McGibbney) Issue 412. Adding support for Map, Sets and Lists to ThriftToDynamicProto (Brian Ramos) Issue 411. Fix NPE in CompositeRecordReader due to improper delegate initialization (Jonathan Coveney) Issue 409. Gzip objects before storing them in the job conf (Alex Levenson) Issue 407. Make dependencies expicit in Readme quickstart (fixes #406) (Lewis John McGibbney) Issue 405. Fix bug in CompositeRecordReader (Jonathan Coveney) Issue 403. Refactor CompositeRecordReader to only make a record reader when necessary (Jonathan Coveney) Issue 398. Add CombineFileInputFormat support (esp. for lzo) (Jonathan Coveney)

    Source code(tar.gz)
    Source code(zip)
Owner
Twitter
Twitter 💙 #opensource
Twitter
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Nikolas Petrou 2 Dec 4, 2022
PageRank implementation in hadoop

PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script

Maksym Zub 1 Jan 24, 2022
A collection of my Ghidra scripts

ghidra-scripts A collection of my Ghidra scripts. iOS FOX: This script locates all calls to objc_msgSend family functions, tries to infer the actual m

null 63 Dec 25, 2022
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 561 Dec 4, 2022
MapReduce Code for Counting the numbers in JAVA

Anurag000-rgb/MapReduce-Repetation_Counting MapReduce Code for Counting the numbers in JAVA Basically in this project But it good to write in Apache Spark using scala Rather In Apache MapReduce

Anurag Panda 2 Mar 1, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022