Big Data (Hadoop) with Azure (HDInsight) with R (for free)

Angela W.
4 min readMay 24, 2020
Hadoop with Azure: HDInsight

Congrats, you are a step nearer to be part of the Hadoop family on Azure cloud. Recently I started to explore the ‘data technologies’ Microsoft offered alongside with their free $200 Azure credits (you can do alot of things with it! Just create a new outlook email account and you will be eligible for the free credit.)

I love to connect my Jupyter notebook or Rstudio to all the things in the world, and undoubtly I did the same for HDInsight.

It is relatively easier to do the same things in Python and although still easy but just not as straight forward as in R (why why why).

If you have been googling for it (like me) for awhile, you will love me for that.

I’m not going to write every function, just enough to get you started.

Connection to Datalake Gen2 (adls) File System

You will need to install the package called AzureStor developed by Azure

devtools::install_github(“Azure/AzureStor”)

Load the libraries:

library(AzureRMR)
library(AzureGraph)
library(AzureStor)

There are three ways to get authenticated

AzureGraph

AzureGraph uses a similar authentication procedure to AzureRMR and the Azure CLI. The first time you authenticate with a given Azure Active Directory tenant, you call create_graph_login() and supply your credentials. AzureGraph will prompt you for permission to create a special data directory in which to cache the obtained authentication token and AD Graph login. Once this information is saved on your machine, it can be retrieved in subsequent R sessions with get_graph_login(). Your credentials will be automatically refreshed so you don’t have to reauthenticate.

library(AzureGraph)# set your Azure organization and subscription details here
tenant <- "mytenant"
sub_id <- "12345678-aaaa-bbbb-cccc-0123456789ab"
# authenticate with AAD
# - on first login, call create_graph_login()
# - on subsequent logins, call get_graph_login()
# create a Graph client
AzureGraph::create_graph_login(tenant)
# account of the logged-in user (if you authenticated via the default method)
me <- gr$get_user()
# alternative: supply an email address or GUID
me2 <- gr$get_user("xx@hotmail.com")

Via Apps Access (troublesome to set up)

Instead of using the codes, you may also create the apps manually with instruction stated here. If you experience any error in authenticating the apps with your sorage account, please go to https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad-app to set up your apps access right and configurations.

library(AzureRMR)
library(AzureGraph)
library(AzureStor)
# set your Azure organization and subscription details here
tenant <- "mytenant"
sub_id <- "12345678-aaaa-bbbb-cccc-0123456789ab"
# create a Graph clientgr <- AzureGraph::create_graph_login(tenant)# create an app (associated service principal will also
# be created automatically)
# to retrieve created apps, call gr$get_app("apps id") from Azure AD
app <- gr$create_app("AzureRapp")# create a Resource Manager client
az <- AzureRMR::create_azure_login(tenant)
# create a Resource Manager client
az <- AzureRMR::create_azure_login(tenant)
# Retrieve the resource group and storage account
rg <- az$
get_subscription(sub_id)$get_resource_group("NAME OF YOUR RESOURCE")
stor <- rg$get_storage_account("NAME OF YOUR STORAGE ACCT")# give storage contributor rights to the app
# checkout azure documents for different access riht
stor$add_role_assignment(app, "Storage account contributor")# authenticate with the app
# if you have error, if you have to set up by following
token <- AzureAuth::get_azure_token(
resource="https://storage.azure.com/",
tenant=tenant,
app=app$properties$appId,
password=app$password
)
# blob endpoint object
stor_client <- storage_endpoint("https://<yourstoragename>.blob.core.windows.net", token=token)
# create a blob container --
# authentication details passed down from endpoint
stor_container <- create_storage_container(stor_client, "mycontainer")
# upload a file
storage_upload(stor_container, "/path/to/mybigfile.txt", "mybigfile.txt")

How to retrieve your tenant id:

Where to create or set up your apps access manually:

Direct Access, easiest for quick connect, not recommended for production environment or shared access

Please note that the ends point are different for every file storage type, in the following example, i will use dfs as an example, which is for Azure Data Lake Storage Gen2 (a.k.a. ADLS Gen2). Types of Endpoints are:

dfs: https://<your storage account name>.dfs.core.windows.net/

web: https://<your storage account name>.z23.web.core.windows.net/

blob: https://<your storage account name>.blob.core.windows.net/

queue: https://<your storage account name>.queue.core.windows.net/

table: https://<your storage account name>.table.core.windows.net/

file: https://<your storage account name>.file.core.windows.net/

kindly note that every file system as a different function for endpoint retrieval. adls_endpoint() is used for ADLS, etc

Demo codes for ADLS file download or upload with R:

library(AzureStor)# ADLS Gen2 Endpoint accessadls_endp <- adls_endpoint(
"https://<storagename>.dfs.core.windows.net/",
key="access key")
# an existing container
cont <- adls_filesystem(adls_endp, "<Name of Container>")
# list all the files in the container
list_adls_files(cont)

There are many other functions which you may wish to explore, i.e. download or upload one by one or in parallel.

ENJOY! And please share if you like the articles.

Thank you.

--

--