Running compute-intense parts of BigStitcher distributed



Running compute-intense parts of BigStitcher distributed. For now we support fusion with affine transformation models (including translations of course). It should scale very well to large datasets as it tests for each block that is written which images are overlapping. You simply need to specify the XML of a BigSticher project and decide which channels, timepoints, etc. to fuse. Warning: not tested on 2D yet.

Sharing this early as it might be useful ...

Here is my example config for this example dataset for the main class net.preibisch.bigstitcher.spark.AffineFusion:

-x '~/test/dataset.xml'
-o '~/test/test-spark.n5'
-d '/ch488/s0'
--minIntensity 1
--maxIntensity 254
--channelId 0

Note: here I save it as UINT8 [0..255] and scale all intensities between 1 and 254 to that range (so it is more obvious what happens). If you omit UINT8, it'll save as FLOAT32 and no minIntensity and maxIntensity are required. UINT16 [0..65535] is also supported.

Importantly: since we have more than one channel, I specified to use channel 0, otherwise the channels are fused together, which is most likely not desired. Same applies if multiple timepoints are present.

The blocksize is currently hardcoded to 128x128x128, but can easily be added as another parameter (pull requests welcome :).

And for local spark you need JVM paramters (8 cores, 50GB RAM):

-Dspark.master=local[8] -Xmx50G

Ask your sysadmin for help how to run it on your cluster. mvn clean package builds target/BigStitcher-Spark-jar-with-dependencies.jar for distribution.

You can open the N5 in Fiji (File > Import > N5) or by using n5-view from the n5-utils package (

You can create a multiresolution pyramid of this data using

Update: now there is support for non-rigid distributed fusion using net.preibisch.bigstitcher.spark.NonRigidFusionSpark In order to run it one needs to additionally define the corresponding interest points, e.g. -ip beads that will be used to compute the non-rigid transformation.

  could not find XmlIoBasicImgLoader implementation for format bdv.n5

    could not find XmlIoBasicImgLoader implementation for format bdv.n5

    Hi @StephanPreibisch,

    Tried running this on the Janelia cluster and got this error. could not find XmlIoBasicImgLoader implementation for format bdv.n5

    I am using the latest spark-janelia from here

  Fusion fails on local Spark instance

    Fusion fails on local Spark instance

    Hi @trautmane,

    As requested, here are the details on what we are running into trying to fuse a BDV file using BigStitcher-Spark. The plugin was built with the code changes on main, but not fix_bdv_n5 as I got a conflict when I tried to merge the two branches. I've attached the XML as well.

    It wasn't totally clear if the extraJavaOptions should be passed to the driver or executors when in local mode, so we tried both. The same error as pasted here message pops up. We also tried allocating more RAM to the executors, same error message as pasted here pops up.

    The error usually occurs once >6,000 files within the N5 have been written. In this particular case, ~7,200 files were written.

    Please let me know what other information I can provide.

    Thanks! Doug

    XML file:

    opened by dpshepherd 18
  Error Could not initialize class ch.systemsx.cisd.hdf5.CharacterEncoding on AffineExport on h5 file

    Error Could not initialize class ch.systemsx.cisd.hdf5.CharacterEncoding on AffineExport on h5 file

    Hi @StephanPreibisch,

    I am trying to do an AffineExport with spark:

    ~/spark-janelia/ 4 \
    /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-0.0.2-SNAPSHOT.jar \ 
    net.preibisch.bigstitcher.spark.AffineFusion \
    -x '/groups/mousebrainmicro/mousebrainmicro/data/Lightsheet/20210812_AG/ML_Rendering-test/aligned_data.xml' \
    -o  '/nrs/svoboda/moharb/test_ML.n5' -d '/s0' 

    And get this error:

    2022-04-21 15:45:37,731 [task-result-getter-0] ERROR [TaskSetManager]: Task 1 in stage 0.0 failed 4 times; aborting job
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 78,, executor 0): java.lang.NoClassDefFoundError: Could not initialize class ch.systemsx.cisd.hdf5.CharacterEncoding
    	at ch.systemsx.cisd.hdf5.HDF5BaseReader.<init>(
    	at ch.systemsx.cisd.hdf5.HDF5BaseReader.<init>(
    	at ch.systemsx.cisd.hdf5.HDF5ReaderConfigurator.reader(
    	at ch.systemsx.cisd.hdf5.HDF5FactoryProvider$HDF5Factory.openForReading(
    	at ch.systemsx.cisd.hdf5.HDF5Factory.openForReading(
    	at bdv.img.hdf5.Hdf5ImageLoader.getSetupImgLoader(
    	at bdv.img.hdf5.Hdf5ImageLoader.getSetupImgLoader(
    	at net.preibisch.bigstitcher.spark.util.ViewUtil.getTransformedBoundingBox(
    	at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$7b7a6284$1(
    	at scala.collection.Iterator.foreach(Iterator.scala:941)
    	at scala.collection.Iterator.foreach$(Iterator.scala:941)
    	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986)
    	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986)
    	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(
    	at java.util.concurrent.ThreadPoolExecutor$

    I can open it in Fiji and look at the data with BigStitcher without an issue. The xml is in: /groups/mousebrainmicro/mousebrainmicro/data/Lightsheet/20210812_AG/ML_Rendering-test/aligned_data.xml Any idea what to do? Found this, might be related.

    Thanks, Boaz

    opened by boazmohar 14
  • OutOfMemoryError caused by creation of too many N5ImageLoader fetcher threads

    OutOfMemoryError caused by creation of too many N5ImageLoader fetcher threads

    While working through issue 2 with a larger data set, @boazmohar discovered many OutOfMemoryError: unable to create new native thread exceptions in the worker logs. These exceptions are raised because parallelized RDDs create many N5ImageLoader instances like this one and each N5ImageLoader instance in turn creates Runtime.getRuntime().availableProcessors() fetcher threads.

    I reduced some of the fetcher thread creation by reusing loaders in this commit. However, reusing loaders did not completely solve the problem.

    I think the best solution is to parameterize the number of fetcher threads in the N5ImageLoader and then explicitly set fetcher thread counts in spark clients. This issue can remain open until that happens or until another long term solution is developed.

    In the mean time as a work-around, overriding the default availableProcessors value with a -XX:ActiveProcessorCount=1 JVM directive seems to fix the problem.

    More specifically, here are the spark-janelia environment parameters I used to successfully process @boazmohar 's larger data set:

    # --------------------------------------------------------------------
    # Default Spark Setup (11 cores per worker)
    # --------------------------------------------------------------------
    # To distribute work evenly, recommended number of tasks/partitions is 3 times the number of cores.
    export N_CORES_DRIVER=1
    # setting ActiveProcessorCount to 1 ensures Runtime.availableProcessors() returns 1
    export SUBMIT_ARGS="--conf spark.executor.extraJavaOptions=-XX:ActiveProcessorCount=1"

    With the limited active processor count and reusing loaders, no OutOfMemory exceptions occur and processing completes much faster. @boazmohar noted that with his original setup, it took 3.5 hours using a Spark cluster with 2011 cores. My run with the parameters above took 7 minutes using 2200 cores (on 200 11-core worker nodes). Boaz's original run might have had other configuration issues, so this isn't necessarily apples-to-apples. Nevertheless, my guess is that his performance was adversely affected by the fetcher thread problem.

    Finally, @StephanPreibisch may want to revisit the getTransformedBoundingBox code and any other loading/reading to see if there are other options for reducing/reusing loaded data within the parallelized RDD loops. Broadcast variables might be suitable/helpful for this use case - but I'm not sure.

    opened by trautmane 10
  Build failure issue on local server

    Build failure issue on local server

    Hi all,

    I just pulled the most release recent (about 10 minutes ago) and tried to build this project on our Linux Mint 19 server.

    Running (added flags to get debugging): mvn -e -X clean package -P fatjar

    I get a build error. I have attached the build log at the end of this message.

    It looks like it might be a Java version mismatch? Any suggestions on how to correctly build the project? I can change the Java JDK if I know which one to install.

    Relevant versions:

    mvn -version
    Apache Maven 3.6.0
    Maven home: /usr/share/maven
    Java version: 11.0.13, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64
    Default locale: en_US, platform encoding: UTF-8
    OS name: "linux", version: "5.4.0-74-generic", arch: "amd64", family: "unix"



    opened by dpshepherd 3
  • NoSuchMethodError for Gson library

    NoSuchMethodError for Gson library

    Hi @kgabor,

    You mentioned that after adding zarr export support to AffineFusion, you ran into the following error when running a distributed spark instance:

    Exception in thread "main" java.lang.NoSuchMethodError:;[Ljava/lang/reflect/Type;)Lcom/google/gson/reflect/TypeToken;
            at org.janelia.saalfeldlab.n5.zarr.N5ZarrReader.getZArraryAttributes(

    I think this problem occurs because the Hadoop libraries used by Spark pull in an ancient version of Gson (likely 2.2.4 or similar) and the n5 zarr library (currently) relies upon Gson 2.8.6. Even though Gson 2.8.6 is bundled in the big-stitcher fat jar, the Hadoop stuff is higher in the classpath when running a Spark cluster - so you end up running with ancient Gson. This post describes the issue very nicely.

    As the post mentions, the best way to fix this issue is to force Spark to use a newer Gson library by specifying additional --conf arguments when launching spark-submit like this:

      --deploy-mode client 
      --master spark://... 
      --conf spark.driver.extraClassPath=/groups/scicompsoft/home/trautmane/bigstitcher/gabor/gson-2.8.6.jar
      --conf spark.executor.extraClassPath=/groups/scicompsoft/home/trautmane/bigstitcher/gabor/gson-2.8.6.jar

    You'll need to:

    • find a gson-2.8.6.jar file - I pulled it from my local maven repo: ${HOME}/.m2/repository/com/google/code/gson/gson/2.8.6/gson-2.8.6.jar,
    • copy it to a network filesystem location that your spark driver and workers can access, and
    • then add the path to the spark.driver.extraClassPath and spark.executor.extraClassPath configuration as I did above.

    Give this a shot and let me know if it solves the errors you were getting. I ran a small test case at Janelia and was able to successfully produce a zarr result - so, I'm hopeful this will work for you.

    Finally while debugging this problem, I made a few minor tweaks to your commits here that mean you will need to specify -s ZARR (capitalized) instead of your original lower case version if you pull and run with the latest code.

    Let me know how it goes, Eric

    opened by trautmane 2
  • Add section about building the executable to README

    Add section about building the executable to README

    This adds a short section pointing out the install script and its options.

    Background: When trying to install this on my local machine, I had various issues trying to build/install this as a non-Java-person. I managed to build with maven eventually but had problems with dependencies. Only then did I notice the install script that handled all of this gracefully. I think it deserves a prominent placement in the README :)

    opened by VolkerH 0
  • Output in a way that BigStitcher can open the fused data

    Output in a way that BigStitcher can open the fused data

    This should support downsampling (@trautmane) and an XML and maybe integration of several channels in one XML (@boazmohar).

    Maybe we should adjust how we load N5's in BDV @tpietzsch?

    opened by StephanPreibisch 2
  • Preserve original data anisotropy?

    Preserve original data anisotropy?

    Hi all,

    We've got this working fairly well locally. Still struggling with our school Hadoop cluster, which I think is a config issue.

    A lot of our data has anisotropic xy pixel size vs z steps. What is the best way to get the BigStitcher-Spark affine fusion to act the same way as the "preserve original data anisotropy" setting in BigStitcher?

    One thought I had was to edit the XML to change the calibration before fusion.


    opened by dpshepherd 28
  • computational complexity of big sticher fusion shows undesirable scaling behaviour with number of tiles

    computational complexity of big sticher fusion shows undesirable scaling behaviour with number of tiles

    Hi @StephanPreibisch ,

    thanks for this new project. Need to set up Spark first, but keen to give this a try. Saw the announcement on twitter but as I don't have a twitter account I'll ask a related question here:

    We ran into issues running fusion with a large number of 2D tiles (not using the Spark version). The fusion step would just take many hours when fusing around 700 individual 2D tiles (mosaic scan of a whole slide). We observed that the scaling behaviour with the number of tiles was very unfortunate (polynomic), where I would expect it should only grow approximately linearly with the number of output pixels.

    As I had the impression that Big Stiticher was primarily developed for Light-Sheet data (fewer but much larger volume tiles and not many 2D tiles) this scaling behaviour with the number of tiles might have gone unnoticed?

    EDIT to add:

    The above behaviour was noticed on the non-Spark version of affine fusion, any chance this has already been fixed with this code?

    opened by VolkerH 2
