Apache Spark已经彻底改变了大数据处理,成为全球数据工程师和分析师的首选解决方案。其闪电般的处理速度和强大的功能使其成为高效处理大量数据的重要工具。如果你想在你的Debian 12(书虫)系统上利用Apache Spark的强大功能,那么你来对地方了。
在 Debian 12 书虫上安装 Apache Spark
第 1 步。在我们安装任何软件之前,通过在终端中运行以下命令来确保您的系统是最新的非常重要:apt
<span class="pln">sudo apt update sudo apt upgrade</span>
此命令将刷新存储库,允许您安装最新版本的软件包。
第 2 步。安装爪哇。
现在,让我们安装所需的依赖项,包括OpenJDK(Java Development Kit),这是Apache Spark的先决条件:
<span class="pln">sudo apt install </span><span class="kwd">default</span><span class="pun">-</span><span class="pln">jdk</span>
通过检查其版本来确认 Java 是否已正确安装:
<span class="pln">java </span><span class="pun">-</span><span class="pln">version</span>
第 3 步。在 Debian 12 上安装 Apache Spark。
准备好系统后,是时候获取最新版本的 Apache Spark 并为您的大数据之旅奠定基础了:
<span class="pln">wget https</span><span class="pun">:</span><span class="com">//www.apache.org/dyn/closer.lua/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz</span>
<span class="pln">tar xvf spark</span><span class="pun">-<</span><span class="pln">spark</span><span class="pun">-</span><span class="pln">version</span><span class="pun">>-</span><span class="pln">bin</span><span class="pun">-</span><span class="pln">hadoop3</span><span class="pun">.</span><span class="pln">tgz</span>
/opt/
<span class="pln">sudo mv spark</span><span class="pun">-<</span><span class="pln">spark</span><span class="pun">-</span><span class="pln">version</span><span class="pun">>-</span><span class="pln">bin</span><span class="pun">-</span><span class="pln">hadoop3 </span><span class="pun">/</span><span class="pln">opt</span><span class="pun">/</span><span class="pln">spark</span>
.bashrc
<span class="pln">nano </span><span class="pun">~/.</span><span class="pln">bashrc</span>
<span class="com"># Apache Spark</span> <span class="kwd">export</span><span class="pln"> SPARK_HOME</span><span class="pun">=</span><span class="str">/opt/</span><span class="pln">spark </span><span class="kwd">export</span><span class="pln"> PATH</span><span class="pun">=</span><span class="pln">$PATH</span><span class="pun">:</span><span class="pln">$SPARK_HOME</span><span class="pun">/</span><span class="pln">bin</span>
<span class="pln">source </span><span class="pun">~/.</span><span class="pln">bashrc</span>
<span class="pln">spark</span><span class="pun">-</span><span class="pln">shell</span>
<span class="pln">scala</span><span class="pun">></span>
<span class="pln">spark</span><span class="pun">-</span><span class="pln">shell</span>
- 读取 CSV 文件:
<span class="pln">val data </span><span class="pun">=</span><span class="pln"> spark</span><span class="pun">.</span><span class="pln">read</span><span class="pun">.</span><span class="pln">format</span><span class="pun">(</span><span class="str">"csv"</span><span class="pun">)</span> <span class="pun">.</span><span class="pln">option</span><span class="pun">(</span><span class="str">"header"</span><span class="pun">,</span> <span class="str">"true"</span><span class="pun">)</span> <span class="pun">.</span><span class="pln">load</span><span class="pun">(</span><span class="str">"/path/to/your/csv/file.csv"</span><span class="pun">)</span>
- 显示数据:
要显示数据帧内容,只需键入变量名称并按 Enter 键:
<span class="pln">data</span><span class="pun">.</span><span class="pln">show</span><span class="pun">()</span>
- 正在执行操作:
您可以使用 Spark 的函数式编程 API 在数据帧上执行各种转换,例如筛选、分组和聚合。
示例:让我们计算名为“price”的列的平均值:
<span class="pln">val avgPrice </span><span class="pun">=</span><span class="pln"> data</span><span class="pun">.</span><span class="pln">agg</span><span class="pun">(</span><span class="pln">avg</span><span class="pun">(</span><span class="str">"price"</span><span class="pun">)).</span><span class="pln">collect</span><span class="pun">()(</span><span class="lit">0</span><span class="pun">)(</span><span class="lit">0</span><span class="pun">)</span><span class="pln"> println</span><span class="pun">(</span><span class="pln">s</span><span class="str">"The average price is: $avgPrice"</span><span class="pun">)</span>
第5步。设置 Spark 群集(可选)。
虽然Spark可以在本地运行,但它的真正功能在部署在集群上时会大放异彩。通过设置 Spark 群集,可以在多个节点之间分配数据处理任务,从而显著提高性能和可伸缩性。
- 准备节点:确保群集中的所有节点都安装了相同版本的 Java 和 Spark。将 Spark 安装目录复制到每个节点。
- 在主节点上配置 Spark:在主节点上,导航到 Spark 配置目录:
<span class="pln">cd </span><span class="pun">/</span><span class="pln">opt</span><span class="pun">/</span><span class="pln">spark</span><span class="pun">/</span><span class="pln">conf</span>
将文件复制到 :spark-env.sh.template
spark-env.sh
<span class="pln">cp spark</span><span class="pun">-</span><span class="pln">env</span><span class="pun">.</span><span class="pln">sh</span><span class="pun">.</span><span class="kwd">template</span><span class="pln"> spark</span><span class="pun">-</span><span class="pln">env</span><span class="pun">.</span><span class="pln">sh</span>
编辑文件以配置主节点和其他设置:spark-env.sh
<span class="pln">nano spark</span><span class="pun">-</span><span class="pln">env</span><span class="pun">.</span><span class="pln">sh</span>
添加以下行以指定主节点的 IP 地址,并为 Spark 驱动程序和工作线程分配内存:
<span class="kwd">export</span><span class="pln"> SPARK_MASTER_HOST</span><span class="pun">=<</span><span class="pln">master</span><span class="pun">-</span><span class="pln">ip</span><span class="pun">></span> <span class="kwd">export</span><span class="pln"> SPARK_MASTER_PORT</span><span class="pun">=</span><span class="lit">7077</span> <span class="kwd">export</span><span class="pln"> SPARK_WORKER_MEMORY</span><span class="pun">=</span><span class="lit">2g</span>
保存更改并退出文本编辑器。
第 6 步。启动主节点。
通过运行以下命令启动 Spark 主节点:
<span class="pln">start</span><span class="pun">-</span><span class="pln">master</span><span class="pun">.</span><span class="pln">sh</span>
通过打开 Web 浏览器并导航到以下内容来访问 Spark Web UI:
<span class="pln">http</span><span class="pun">:</span><span class="com">//<master-ip>:8080</span>
步骤 7.故障排除提示。
安装和配置 Apache Spark 可能会遇到一些挑战。以下是一些常见问题和故障排除提示:
- Java 版本冲突:如果遇到 Java 版本问题,请确保已安装 OpenJDK(Java 开发工具包)版本 8 或更高版本,并正确设置环境变量。
JAVA_HOME
- Spark 外壳故障:如果 Spark 外壳无法启动,请检查环境变量,并确保在系统的 .
PATH
- 端口冲突:如果 Spark Web UI 未加载或显示与端口冲突相关的错误,请验证系统上的其他服务是否未使用指定的端口(例如 8080、7077)。
感谢您使用本教程在 Debian 12 Bookworm 上安装 Apache Spark。有关其他帮助或有用信息,我们建议您查看 Apache 官方网站。