Entry point to ARCTIC infrastructure is hemera interactive web interface. Log in to https://hemera.rs.gsu.edu with your campus credentials. If you see an error, stating that your home directory is not available, that means you do not have access to the cluster. Please inform the instructor of your class and he/she will be able to add you to the class allocation.
Once you log in , you will see following web interface.
Navigate to ACIDS interactive apps - > Jupyter lab- PySpark
We will start a jupyter lab session in the ARCTIC infrastructure. We will only use the jupyter lab instance to submit jobs to remote spark cluster, so you will not need lot of resources in the jupyter lab session. However, you can use downstream analysis, such as creating plots, using jupyter lab session. In such case feel free to request adequate resource. But in all other instances, please request minimal resources you need.
Usually, 2 cores and 2 GB memory will work perfectly. Select duration as appropriately. Account should automatically populate for you. If you do not see account information, please contact your class room instructor.
Once launch, your job will wait for resources
Wait time will depend on how much resource you requested and cluster load. If your job does not start in reasonable time, please consider lowering resource request. Once start, you will see the flowing screen.
Now you are ready to connect to your own jupyter lab instance, running on the ARCTIC Infrastructure
Now we are ready to connect to a remote spark cluster. We will connect to a remote Kubernetes cluster and establish connection in the background.
Please execute following lines in your jupyterlab session
import sparkmagic.utils.configuration as conf
conf.override('cleanup_all_sessions_on_exit',True)
%load_ext sparkmagic.magics
%manage_spark
If everything works according to the plan, then you will see you have a endpoint ready and you should be able to create spark session on the remote k8s cluster. Please makesure your endpoint reads as https://lighter.rs.gsu.edu/lighter/api
We will create a spark cluster in the Kubernetes. For this we will need a spark container image. You can use an image of your choice here. However, this image needs to be properly configured to run in a cluster environment. We will use the docker image hosted at ARCTIC docker repo. harbor.rs.gsu.edu/jupyter/spark . We will use same image as the executor image and the driver image.
Remote Kubernetes cluster configuration can be give as a json configuration file. Following example we create an app named App-1, with pyspark as the entry point. We request two executors and a driver. Each executor has 1 core and 1 GB of RAM and driver has 1 core and 1 GB memory. You can request any number here. Right now, we do not limit it. However, remember, you are using a shared cluster. If you use too much resources, there will be not enough resources for your classmates. Be a nice neighbor.
{
"name": "App-1",
"file": "pyspark",
"numExecutors": 2,
"executorCores": 1,
"executorMemory": "1G",
"driverCores": 1,
"driverMemory": "1G",
"deploy-mode" : "cluster",
"conf" : {
"spark.kubernetes.authenticate.driver.serviceAccountName" : "lighter",
"spark.kubernetes.driver.service.deleteOnTermination" : "True",
"spark.kubernetes.container.image" : "harbor.rs.gsu.edu/jupyter/spark",
"spark.kubernetes.authenticate.driver.imagePullSecrets" : "regcred",
"spark.hadoop.fs.s3a.path.style.access" : "True",
"spark.hadoop.fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem"
}
}
You can pass additional configuration parameters. Please use the current configuration in the config section as it is. These are required to connect to the backend.
Select python as your session type, and copy and paste the above json to properties section and hit create session.
You will see a log message stating that the session is starting. Wait until it become idle. It may take some time.
If session creation is successful, you will see following message
Congratulations , now you have a spark session you can submit workload against.
You can get the spark context from the session.
Test your session with following code
%%spark -s session-name -c spark
sample_data = [{"name": "John D.", "age": 30},
{"name": "Alice G.", "age": 25},
{"name": "Bob T.", "age": 35},
{"name": "Eve A.", "age": 28}]
df = spark.createDataFrame(sample_data)
from pyspark.sql.functions import col, regexp_replace
# Remove additional spaces in name
def remove_extra_spaces(df, column_name):
# Remove extra spaces from the specified column
df_transformed = df.withColumn(column_name, regexp_replace(col(column_name), "\\s+", " "))
return df_transformed
transformed_df = remove_extra_spaces(df, "name")
transformed_df.show()
Jupyter notebook will execute your code remotly and bring back the results. You should see following.
+---+--------+
|age| name|
+---+--------+
| 30| John D.|
| 25|Alice G.|
| 35| Bob T.|
| 28| Eve A.|
+---+--------+
As I mentioned before, we are running a container. It gives us grate flexibility. However, one drawback is that there is no persistent storage attached to the contained. You can configure local storage, but you will need to do custom configuration for each use case. For this class we will get the data through S3 object storage.
ARCTIC maintain S3 object storage. First step is to create a service account attached to your GSU account and get credentials so that we can connect to the storage system remotely. Please log in to https://osiris.rs.gsu.edu with your GSU credentials.
Once you are logged in, navigate to access keys and create an access key.
Hit create and save the credentials, you will no be able to access these credentials later, unless you save the credentials. Keep it safe.
Now you have access credentials that you can use to access the object storage system. Let’s see how to use it.
Now we can use the access credentials to create new bucket in the object storage. Bucket name needs to be unique. So, please provide a unique name.
from minio import Minio
from minio.error import S3Error
client = Minio(
"osiris.rs.gsu.edu",
access_key="xxxxxxxxxxxxx",
secret_key="xxxxxxxxxxxxxxxxx",
secure=True
)
found = client.bucket_exists("test")
if not found:
client.make_bucket("test")
else:
print("Bucket 'test' already exists")
If bucket creation is successful, we can use it for the remainder of the class.
We will be working with 3 types of storage here. First is the any source storage where you have your files are stored in. We will need to move these files to the ARCTIC storage. Once they are in ARCTIC storage you can move it to the object storage. Technically you should be able to move files directly to object storage, but for this class we will not use it.
You can use hemera file transfer capabilities to transfer small files to the ARCTIC infrastructure
you can use upload button to upload files
Once files are in the ARCTIC infrastructure, you can use jupyterlab to transfer them in to the object storage bucket you just created.
client.fput_object(
"test", "remote-file-name", "./local-file-name",
)
Now we have a file in the object storage we can use it inside pyspark. However we need to tell spark session to how to communicate with S3 object storage. Update the spark context with the credentials.
%%spark -s session-name -c spark
sc=spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://osiris.rs.gsu.edu")
Once done you can access files as you normally do, but using s3a:// endpoint. testfio-2 is the bucket name where final_data_new.csv resides.
%%spark
df = spark.read.csv("s3a://testfio-2/final_data_new.csv")
df.show()