Implement joint rucio and wmarchive data for user access time of data…

…sets
dmwm · Sep 11, 2022 · d834c44 · d834c44
1 parent 53f140f
commit d834c44
Show file tree

Hide file tree

Showing 2 changed files with 359 additions and 99 deletions.
diff --git a/doc/pyspark_shell.md b/doc/pyspark_shell.md
@@ -1,8 +1,10 @@
 ## How to run PySpark shell for tests in Kubernetes pods or VMs
 
-If SWAN.cern.ch is not working, you can use PySpark to run your PySpark code. It gives nice IPython shell depending on your python environment.
+If SWAN.cern.ch is not working, you can use PySpark to run your PySpark code. It gives nice IPython shell depending on
+your python environment.
 
 - Kerberos authentication:
+
 ```
 kinit $USER#CERN.CH
 ```
@@ -11,28 +13,35 @@ kinit $USER#CERN.CH
 
 - You need to be in LxPlus7
 - If you use additional Python repositories, please make sure that they are in `PYTHONPATH`
-- `--py-files` is optional, just to put there to show how you can add
+- `--py-files` is optional, just to show how you can add it
+
+> Attention : do not use `LCG102` for now, it produces `ImportError: libffi.so.8` error in LxPlus7.
+>
+> For that reason, you need to provide Avro package like `org.apache.spark:spark-avro_2.12:3.1.2` with `3.1.2` version
+> which is `3.2.1` in LCG102.
+>
+> In any case, please set avro version according to `spark-submit --version`
 
 ###### Run in LxPlus7
+
 ```
 # Setup Analytix connection
 
 source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-gcc8-opt/setup.sh
 source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3
 export PATH="${PATH}:/usr/hdp/hadoop/bin/hadoop:/usr/hdp/spark3/bin:/usr/hdp/sqoop/bin"
+export PYSPARK_DRIVER_PYTHON=ipython
+# Set ipython as driver python
 
 # Required Spark confs
 spark_submit_args=(
   --master yarn 
   --conf spark.ui.showConsoleProgress=false 
   --driver-memory=8g --executor-memory=8g
-  --packages org.apache.spark:spark-avro_2.12:3.2.1 
+  --packages org.apache.spark:spark-avro_2.12:3.1.2 
   --py-files "/data/CMSMonitoring.zip,/data/stomp-v700.zip"
 )
 
-# Set ipython as driver python
-export PYSPARK_DRIVER_PYTHON=ipython
-
 # Run
 pyspark ${spark_submit_args[@]}
 
@@ -49,11 +58,14 @@ pyspark ${spark_submit_args[@]}
 - You need to define :`spark.driver.bindAddress, spark.driver.host, spark.driver.port, spark.driver.blockManager.port`
 - Kubernetes ports should be open in both way In/Out like NodePort
 - If you use additional Python repositories, please make sure that they are in `PYTHONPATH`
-- `--py-files` is optional, just to put there to show how you can add
+- `--py-files` is optional, just to show how you can add it
 
 ###### Run in Kubernetes Pod
 
 ```
+# Set ipython as driver python
+export PYSPARK_DRIVER_PYTHON=ipython
+
 # Required Spark confs
 spark_submit_args=(
   --master yarn 
@@ -67,9 +79,6 @@ spark_submit_args=(
   --py-files "/data/CMSMonitoring.zip,/data/stomp-v700.zip"
 )
 
-# Set ipython as driver python
-export PYSPARK_DRIVER_PYTHON=ipython
-
 # Run
 pyspark ${spark_submit_args[@]}