Jython Processor

Sparkflows has a Jython Processor.

The Jython Processor allows writing Jython code to process the incoming DataFrame. It then produces a resulting DataFrame.

In the Jython node, the following variables are available:

  • inDF : Incoming Spark DataFrame
  • spark : The Spark Session object

Example Jython Code

Below are some example Jython code which can be used.

Select a specific column from the DataFrame

  • outDF = inDF.select(“c2”)

Count the number of records after grouping them

  • outDF = inDF.groupBy(“c2”).count()

Run a SQL on the input DataFrame

The Jython Processor registers the incoming dataframe as a temporary table with a configurable name.

The below SQL in Jython script, performs a SELECT on the registered temporary table.

  • outDF = spark.sql(“SELECT c1, c2 FROM fire_temp_table”)

Run a SQL followed by further grouping and count

  • outDF = spark.sql(“SELECT c1, c2 FROM fire_temp_table”)
  • outDF = outDF.groupBy(“c2”).count()

Read from HDFS and create a new DataFrame

The below Jython script, reads a JSON file from HDFS.

  • outDF = spark.read().json(“data/people.json”)