GitHub Search Engine: Web Application used to retrieve, store and present projects from GitHub, as well as any statistics related to them.

Overview

GHSearch Platform

This project is made of two subprojects:

  1. application: The main application has two main responsibilities:
    1. Crawling GitHub and retrieving repository information. This can be disabled with app.crawl.enabled argument.
    2. Serving as the backend server for website/frontend
  2. front-end: A frontend for searching the database, which is available at http://seart-ghs.si.usi.ch

Setup & Run Project Locally (for development)

The detailed instruction can be find here.

Dockerisation 🐳

The instruction to deploy the project via Docker is available here.

More Info on Flyway and Database Migration

To learn more about Flyway you can read on here.


FAQ

How can I report a bug or request a feature or ask a question?**

Please add a new issue and we will get back to you very soon.

How add a new programming language to platform?

  1. See the "Adding C#" commit on December 17th 2020.
  2. Create a new Flyway migration file to insert a new language row on supported_languages table.
Comments
  • Mined repositories languages

    Mined repositories languages

    Is there a way to see if some important languages are excluded from the mining?

    I have seen the language stats report in the link 'Mined Projects'. Are these the 13 most widespread languages and everything else is 'below Kotlin' or there are holes with widespread languages in between?

    opened by wolfenmark 7
  • Gateway Time-Out

    Gateway Time-Out

    What's heppening? Hi, when i use the web interface without any filter to save the result it gives me "504 Gateway Time-Out error".

    How to see the problem again? Open web interface without choosing anything just clike on Search button and then clicke on Download CSV.

    Output Result: Generated url:

    https://seart-ghs.si.usi.ch/api/r/download/csv?hasWiki=false&onlyForks=false&hasLicense=false&nameEquals=false&hasPulls=false&excludeForks=false&hasIssues=false
    
    Screen Shot 2022-10-24 at 2 52 58 PM

    Btw. Can you share the whole up-to-date dataset? I found a dataset on zenodo, but it is not up-to-date

    bug 
    opened by kargaranamir 3
  • Total Issues and Total Pull Reqs null instead of 0

    Total Issues and Total Pull Reqs null instead of 0

    Values for 'Total Issues' and 'Total Pull Reqs' are sometimes 'null' instead of 0.

    For example: image

    Link to GitHub repo: https://github.com/0xc0000054/pdn-gmic

    Here are some correct examples:

    image

    image

    image

    bug 
    opened by marodev 3
  • Number of contributors overreported

    Number of contributors overreported

    I recently pulled some data and noticed that there were a few projects that had a reported number of contributors that is quite odd.

    I noticed this for three projects in the data set I am using. The projects in question are: radeonopencompute/rock-kernel-driver youling257/android-mainline xanmod/linux

    In the csv output generated the number of contributors for these projects is 9.22337203685477E+018. Or at least it is after I import it to a spreadsheet.

    On the website search results the number of contributors for these projects is 9223372036854776000.

    Based on the number it looks like this is a JavaScript error causing the maximum integer value to be displayed. Obviously the number of contributors to the projects in question cannot be close to these reported values.

    bug 
    opened by jhs507 2
  • Remove id from fields if it's an internal id

    Remove id from fields if it's an internal id

    Hi, I found the 'id' field misleading, interpreting it as an id to retrieve the project from the GH API. Can it be removed if it's only a GHSearch internal identifier?

    opened by wolfenmark 2
  • Export 'forked from' and 'fork source' fields

    Export 'forked from' and 'fork source' fields

    Hi, I was wondering if it could be useful to have forked parameters like 'forked from' and 'fork source' (specifically for repos that are forks).

    I know that every possible detail might be useful in one or more cases. Just wondering if these two fields are worth having in the output.

    Example: if you need to analyze forks and fork chains starting from a subset of projects GHS could be used directly for that kind of research if fork chains could be reconstructed directly without scraping GH.

    opened by wolfenmark 2
  • Include search parameters in the exported JSON

    Include search parameters in the exported JSON

    Hi, I would find very useful to have the search parameters stored in the exported JSON. Not sure about other export formats but I suppose a similar approach would still apply.

    Top level in JSON there is 'items'. Adding something like 'search parameters' at the same level could help in keeping the specifics close to the data (instead of relying on file renaming or just manual annotation). It would help in keeping track of the search criteria months after when a dataset pops out from a folder and I can't remember exactly the specifics of it.

    feature 
    opened by wolfenmark 2
  • Statistics breakdown about current search

    Statistics breakdown about current search

    Hi, maybe it would be useful to have a statistics breakdown after returning a successful search (i.e., how many different languages, average contributors, min-max-average commits just to name a few examples).

    An overview of the selected bunch of repositories could help in refining search criteria or have a first glimpse of what's inside a potential dataset even before exporting the output.

    In general min, max, average, median, unique values (and/or other usual descriptive statistics) for each 'applicable' output field could be useful for recap/early analysis.

    feature 
    opened by wolfenmark 2
  • Back button keeps previous search parameters

    Back button keeps previous search parameters

    Hi, I was wondering if it's possible to keep the search parameters after returning from a search ('Back' button).

    Example use case:

    • Set some filtering criteria
    • Search
    • Get results
    • Want to refine/fine-tune/adjust search criteria
    • hit the back button
    • modify criteria I want to adjust (others are "saved/restored" automatically)

    Maybe can be also done by supporting the browser's back button (actually ignoring 'in site' navigation) and removing the custom one.

    feature 
    opened by wolfenmark 2
  • Super slow performance (even for one serching one repository)

    Super slow performance (even for one serching one repository)

    Consider searching apache/commons-math (in the exact match mode), and it takes the web app up to 1 minute (!!) to perform the search. This is the same for downloading the result (i.e., a csv with one row).

    enhancement 
    opened by emadpres 2
  • Mismatched results for C++ project

    Mismatched results for C++ project

    Hi, when I search for projects in C++ the result on webpage is correct but the result in the download file is not correct. The file seems to only shows projects written in C.

    opened by hsrain3 1
  • Unique resilient identifier

    Unique resilient identifier

    Hi, I had problems in finding a unique identifier resilient to repo name changes, ownership changes, etc.

    Can you think of any such identifier that can be exposed in the export to have a more reliable (time-invariant) way of retrieving a repository?

    I see from GH rest API that there is indeed an id returned but I'm not sure if this is usable and/or if this is already what you export as id (possibly relate to #16)

    enhancement 
    opened by wolfenmark 0
Owner
SEART - SoftwarE Analytics Research Team
The SEART group is part of the Software Institute at the Università della Svizzera italiana, located in Lugano, Switzerland.
SEART - SoftwarE Analytics Research Team
OpenSearch is an open source distributed and RESTful search engine.

OpenSearch is an open source search and analytics engine derived from Elasticsearch

null 6.2k Jan 1, 2023
A simple fast search engine written in java with the help of the Collection API which takes in multiple queries and outputs results accordingly.

A simple fast search engine written in java with the help of the Collection API which takes in multiple queries and outputs results accordingly.

Adnan Hossain 6 Oct 24, 2022
Apache Lucene is a high-performance, full featured text search engine library written in Java.

Apache Lucene is a high-performance, full featured text search engine library written in Java.

The Apache Software Foundation 1.4k Jan 5, 2023
filehunter - Simple, fast, open source file search engine

Simple, fast, open source file search engine. Designed to be local file search engine for places where multiple documents are stored on multiple hosts with multiple directories.

null 32 Sep 14, 2022
Apache Lucene and Solr open-source search software

Apache Lucene and Solr have separate repositories now! Solr has become a top-level Apache project and main line development for Lucene and Solr is hap

The Apache Software Foundation 4.3k Jan 7, 2023
Apache Solr is an enterprise search platform written in Java and using Apache Lucene.

Apache Solr is an enterprise search platform written in Java and using Apache Lucene. Major features include full-text search, index replication and sharding, and result faceting and highlighting.

The Apache Software Foundation 630 Dec 28, 2022
A proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework.

Lucene Serverless This project demonstrates a proof-of-concept serverless full-text search solution built with Apache Lucene and Quarkus framework. ✔️

Arseny Yankovsky 38 Oct 29, 2022
🔍An open source GitLab/Gitee/Gitea code search tool. Kooder 是一个为 Gitee/GitLab 开发的开源代码搜索工具,这是一个镜像仓库,主仓库在 Gitee。

Kooder is a open source code search project, offering code, repositories and issues search service for code hosting platforms including Gitee, GitLab and Gitea.

开源中国 350 Dec 30, 2022
Simple full text indexing and searching library for Java

indexer4j Simple full text indexing and searching library for Java Install Gradle repositories { jcenter() } dependencies { compile 'com.haeun

Haeun Kim 47 May 18, 2022
Path Finding Visualizer for Breadth first search, Depth first search, Best first search and A* search made with java swing

Path-Finding-Visualizer Purpose This is a tool to visualize search algorithms Algorithms featured Breadth First Search Deapth First Search Gready Best

Leonard 11 Oct 20, 2022
Java related projects and also a begginer level projects

Java related projects and also a begginer level projects

Akshit Sijwali 3 Dec 15, 2022
This repository contains Java programs to become zero to hero in Java. Programs related to each and every concep are present from easy to intermidiate level.

Learn Java Programming In this repository you will find topic wise programs of java from basics to intermediate. This follows topic wise approach that

Sahil Batra 15 Oct 9, 2022
Facsimile - Copy Your Most Used Text to Clipboard Easily with Facsimile!. It Helps You to Store You Most Used Text as a Key, Value Pair and Copy it to Clipboard with a Shortcut.

Facsimile An exact copy of Your Information ! Report Bug · Request Feature Table of Contents About The Project Built With Getting Started Installation

Sri lakshmi kanthan P 1 Sep 12, 2022
A fairly Simple Game made in Java,You can adopt Pets, name them, and take care of them for XpPoints and level up!

Introducing PetGame! A simple console based game made by @denzven in Java ☕ About the Game PetGame is my first big project in Java, the rules are simp

Denzven 11 Jun 7, 2022
Android Auto Apps Downloader (AAAD) is an app for Android Phones that downloads popular Android Auto 3rd party apps and installs them in the correct way to have them in Android Auto.

Android Auto Apps Downloader (AAAD) is an app for Android Phones that downloads popular Android Auto 3rd party apps and installs them in the correct way to have them in Android Auto.

Gabriele Rizzo 865 Jan 2, 2023
This repository is related to the Java Web Developer (ND035), Course - Web Services and APIs

About this Repository This repository is related to the Java Web Developer (ND035), Course - Web Services and APIs It contains the following folders:

Rasha Omran 1 Jan 28, 2022
The Apache Software Foundation 605 Dec 30, 2022
Evgeniy Khyst 54 Dec 28, 2022