Linear Regression with EXASolution
EXASolution is an in-memory relational database management system. It’s a product of EXASOL. If you are asking yourself the question “Why EXASOL?” then may be this link might be able to satisfy your curiosity. We will move on to the steps to execute Linear Regression operation on EXASolution.
EXASolution allows us to write UDF (user defined function) in any of the following languages:
- R
- Python
- Java
- Lua
Using this feature we can leverage machine learning capabilities of R.
Prerequisites:
-
EXASolution
You can download EXASolution free version VM from here and start the VM to bring on the EXASolution server. Once the VM is started, you can see all necessary information to connect to the EXASolution server on the VM screen.
-
EXAPlus
EXAplus provides user interface to manage your EXASolution database. EXAPlus can be downloaded from this link.
-
R
You can download R for windows from here. R is a statistical language which bring machine learning capabilities.
-
RStudio
RStudio is an IDE for R. Free version of RStudio can be downloaded from here.
Note that this is an optional step. You can run R scripts directly from R console too.
-
r-exasol package
For windows following steps needs to be followed for this:
-
Install RTools. Can be downloaded from here. RTools is R’s developer extension which allows you to build R package from source. This is need to build
r-exasol
plugin from it’s source. -
Install
devtools
package for R which containsinstall_github()
method which allow us to buildr-exasol
package from github sources.You need to fire following commands in your R console:
-
After successful execution of above mentioned commands, we can use r-exasol
package in R.
-
EXASOL ODBC Driver
This can be downloaded from here. We need ODBC Driver in order to fetch data to R from EXASolution.
Dataset:
For this tutorial I’ll be using healthcare dataset available here.
The dataset contains following columns:
- X1 = death rate per 1000 residents
- X2 = doctor availability per 100,000 residents
- X3 = hospital availability per 100,000 residents
- X4 = annual per capita income in thousands of dollars
- X5 = population density people per square mile
We will use X2, X3, X4 and X5 as features to predict X1.
I have converted the dataset in CSV format and added an id column. It can be downloaded from here.
- Importing the data into EXASolution:
Save the downloaded data file as CSV file as EXASolution has supports data import from CSV file.
Open EXAPlus client and connect to the EXASolution server. Again the details required to establish the connection should be available on EXASolution VM screen.
Here is the script to import data in EXASolution:
- Configure EXABucket
EXABucket is a distributed file-system which EXASolution can access. We need to upload the generated regression model to EXABucket so that using it we can perform predictions on EXASolution.
In order to upload the files to EXABucket, I had to manually set the bucket service port and write user password from EXASolution admin panel.
- Generate regression model and upload to EXABucket:
Open RStudio or R terminal and execute following script (explanation inline):
Note: More details regarding EXABucket can be found here.
- Predict result on EXASolution:
Now since we have regression model available on EXABucket, we can perform predictions by using it. For predictions we will write an UDF in R.
If you know better approach or if you face any queries then fill free to start a discussion in the comments section.
References: