Spark
인-메모리 기반의 클러스터 컴퓨팅 프레임워크인 Spark를 정리 합니다.
- 홈페이지 : http://spark-project.org/, https://spark.apache.org/
- 다운로드 :
- 라이선스 :
- 플랫폼 : Scala
- API : Java, Scala, Python, R
Spark 개요
Apach Spark UC 버클리 대학의 AMPLab에서 내놓은 대용량 분산 처리 및 분석용 오픈소스이다. 2014년 2월부터 아파치 재단의 톱 프로젝트가 되었다.
- 대화형 질의 분석기(Shark), 대용량 그래프 처리 및 분석기(Bagel), 실시간 분석기(Spark Streaming) 등을 함께 제공
Spark 구성
Spark 설치
Spark 설치
cd / mkdir install mkdir appl cd /install wget http://apache.mirror.cdnetworks.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz cd /appl tar -xvzf /install/spark-2.3.0-bin-hadoop2.7.tgz mv spark-2.3.0-bin-hadoop2.7 spark cd /appl/spark cd conf cp spark-env.sh.template spark-env.sh cp log4j.properties.template log4j.properties vi spark-env.sh vi log4j.properties log4j.rootCategory=WARN, console # cd /appl/spark # sbin/start-master.sh # sbin/start-slave.sh spark://localhost:7077 # bin/pyspark -master spark://localhost:7077
Scala 설치
cd /install wget https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz cd /appl tar -xvzf /install/scala-2.12.3.tgz mv scala-2.12.3.tgz scala # export PATH=${PATH}:/appl/scala/bin
폴더 구성
- R/
- bin/
- conf/
- data/
- examples/
- jars/
- kubernetes/
- licenses/
- python/
- sbin/
- yarn/
K-ICT 교육
Spark 설치
cd ~ mkdir install cd ~/install wget http://apache.mirror.cdnetworks.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz cd ~ tar -xvzf /install/spark-2.3.0-bin-hadoop2.7.tgz mv spark-2.3.0-bin-hadoop2.7 spark cd ~/spark cd conf cp spark-env.sh.template spark-env.sh cp log4j.properties.template log4j.properties vi spark-env.sh export LANG=ko_KR.UTF-8 export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 export PATH=$PATH:$JAVA_HOME export HADOOP_INSTALL=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export PATH=$PATH:$HADOOP_INSTALL/sbin export PATH=$PATH:$HADOOP_INSTALL/bin export SPARK_DIST_CLASSPATH=$(hadoop classpath) vi log4j.properties log4j.rootCategory=WARN, console cd ~/spark sbin/start-all.sh bin/pyspark
- Hadoop Resource Manager : http://localhost:8088/
- Hadoop Node Manager : http://localhost:8042/
- Pyspark : http://localhost:4040/
- Spark : http://localhost:8080/
pyspark 실행
Sample 1
help(sqlContext) q #--- q df = sqlContext.read.json("file:///home/eduuser/spark/examples/src/main/resources/people.json") df.show() df.printSchema() df.select("name").show() df.select(df['name'], df['age'] + 1).show() df.filter(df['age'] > 21).show() df.groupBy("age").count().show() quit()
Sample 2
from pyspark.sql import Row lines = sc.textFile("file:///home/eduuser/spark/examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) schemaPeople = sqlContext.createDataFrame(people) schemaPeople.registerTempTable("people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers.show() teenNames = teenagers.map(lambda p: "Name: " + p.name) for teenName in teenNames.collect(): print(teenName) quit()
Sample 3
from pyspark.sql.types import * lines = sc.textFile("file:///home/eduuser/spark/examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: (p[0], p[1].strip())) schemaString = "name age" fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] schema = StructType(fields) schemaPeople = sqlContext.createDataFrame(people, schema) schemaPeople.registerTempTable("people") teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers.show() teenNames = teenagers.map(lambda p: "Name: " + p.name) for teenName in teenNames.collect(): print(teenName) quit()