Creating Dataset for CSV Files

When working with data in Fire Insights, the first step is to create a dataset that you plan to process subsequently. Dataset is a wrapper around your data which makes it easy to handle it in Sparkflows workbench.

When datasets are created, Fire Insights automatically infers the schema using Spark-CSV library from Databricks.

Datasets List

When you open any application, all existing Datasets specific to the application are displayed in the Datasets tab.

Dataset

Dataset Creation

Choose type of Dataset to Create

Navigate to the “Datasets” tab in your application. Click on the “Create” button and choose “Dataset”. In the pop-up choose “CSV” and then click “OK”.

Dataset

Dataset Details

Clicking “OK” will take you to Dataset Details page where you can enter information about your dataset. In the screenshot below, we create a dataset from a housing.csv file. It is a comma separated file with a header row specifying the names of the various columns.

Dataset

For the housing.csv file, we will fill in the required fields as below.

Dataset

We specified a name for the dataset we are creating. ‘Header’ is set to true indicating that the file has a header row, field delimiter is comma and we also specified the path to the file.

Update Sample data/schema

Once we have specified the above, we hit the ‘Update Sample data/schema’ button. This brings up the sample data, infers the schema and displays it. We can change the column names and also the data types. Format column is used for specifying the format for date/time fields.

Dataset
Dataset

Save the Dataset

Clicking the ‘Save’ button creates the new dataset. The dataset is now ready for use in any workflow within the specific application.

Dataset