Skip to content

Commit

Permalink
Implement joint rucio and wmarchive data for user access time of data…
Browse files Browse the repository at this point in the history
…sets
  • Loading branch information
mrceyhun committed Sep 11, 2022
1 parent 53f140f commit d834c44
Show file tree
Hide file tree
Showing 2 changed files with 359 additions and 99 deletions.
29 changes: 19 additions & 10 deletions doc/pyspark_shell.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
## How to run PySpark shell for tests in Kubernetes pods or VMs

If SWAN.cern.ch is not working, you can use PySpark to run your PySpark code. It gives nice IPython shell depending on your python environment.
If SWAN.cern.ch is not working, you can use PySpark to run your PySpark code. It gives nice IPython shell depending on
your python environment.

- Kerberos authentication:

```
kinit $USER#CERN.CH
```
Expand All @@ -11,28 +13,35 @@ kinit $USER#CERN.CH

- You need to be in LxPlus7
- If you use additional Python repositories, please make sure that they are in `PYTHONPATH`
- `--py-files` is optional, just to put there to show how you can add
- `--py-files` is optional, just to show how you can add it

> Attention : do not use `LCG102` for now, it produces `ImportError: libffi.so.8` error in LxPlus7.
>
> For that reason, you need to provide Avro package like `org.apache.spark:spark-avro_2.12:3.1.2` with `3.1.2` version
> which is `3.2.1` in LCG102.
>
> In any case, please set avro version according to `spark-submit --version`
###### Run in LxPlus7

```
# Setup Analytix connection
source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-gcc8-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3
export PATH="${PATH}:/usr/hdp/hadoop/bin/hadoop:/usr/hdp/spark3/bin:/usr/hdp/sqoop/bin"
export PYSPARK_DRIVER_PYTHON=ipython
# Set ipython as driver python
# Required Spark confs
spark_submit_args=(
--master yarn
--conf spark.ui.showConsoleProgress=false
--driver-memory=8g --executor-memory=8g
--packages org.apache.spark:spark-avro_2.12:3.2.1
--packages org.apache.spark:spark-avro_2.12:3.1.2
--py-files "/data/CMSMonitoring.zip,/data/stomp-v700.zip"
)
# Set ipython as driver python
export PYSPARK_DRIVER_PYTHON=ipython
# Run
pyspark ${spark_submit_args[@]}
Expand All @@ -49,11 +58,14 @@ pyspark ${spark_submit_args[@]}
- You need to define :`spark.driver.bindAddress, spark.driver.host, spark.driver.port, spark.driver.blockManager.port`
- Kubernetes ports should be open in both way In/Out like NodePort
- If you use additional Python repositories, please make sure that they are in `PYTHONPATH`
- `--py-files` is optional, just to put there to show how you can add
- `--py-files` is optional, just to show how you can add it

###### Run in Kubernetes Pod

```
# Set ipython as driver python
export PYSPARK_DRIVER_PYTHON=ipython
# Required Spark confs
spark_submit_args=(
--master yarn
Expand All @@ -67,9 +79,6 @@ spark_submit_args=(
--py-files "/data/CMSMonitoring.zip,/data/stomp-v700.zip"
)
# Set ipython as driver python
export PYSPARK_DRIVER_PYTHON=ipython
# Run
pyspark ${spark_submit_args[@]}
Expand Down
Loading

0 comments on commit d834c44

Please sign in to comment.