Bigdata
Managing Spark dataframes in Python
Below a quick sample of using Apache Spark (2.0) dataframes for manipulating data. Sample data is a file of jsonlines like``` {“description”: “255/40 ZR17 94W”, “ean”: “EAN: 4981910401193”, “season”: “tires_season summer”, “price”: “203,98”, “model”: “Michelin Pilot Sport PS2 255/40 R17”, “id”: “MPN: 2351610”} {“description”: “225/55 R17 101V XL”, “ean”: “EAN: 5452000438744”, “season”: “tires_season summer”, “price”: “120,98”, “model”: “Pirelli P Zero 205/45 R17”, “id”: “MPN: 530155”}
from pyspark.sql import SparkSession from pyspark.
Bigdata
Any faster alternative to #Hadoop HDFS?
I’d like to have an alternative to Hadoop HDFS, a faster and not java filesystem:
S3: S3 Support in Apache Hadoop if your servers are hosted at Amazon AWS chep: using hadoop with ceph glusterfs: managing hadoop compatible storage lustre: Running hadoop with lustre Openstack Swift: Hadoop OpenStack Support: Swift Object Store xstreamfs: there is an hadoop client Which is better? Any suggestions? References:
[1] https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
Bigdata
Apache Spark howto import data from a jdbc database using python
Using Apache spark 2.0 and python I’ll show how to import a table from a relational database (using its jdbc driver) into a python dataframe and save it in a parquet file. In this demo the database is an oracle 12.x file jdbc-to-parquet.py:``` from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName(“Python Spark SQL basic example”) \
.getOrCreate()
df = spark.read.format(“jdbc”).options(url=“jdbc:oracle:thin:ro/ro@mydboracle.redaelli.org:1521:MYSID”, dbtable=“myuser.dim_country”, driver=“oracle.jdbc.OracleDriver”).load()
df.write.parquet(“country.parquet”)
Bigdata
Analyzing huge sensor data in near realtime with Apache Spark Streaming
For this demo I downloaded and installed Apache Spark 1.5.1 Suppose you have a stream of data from several (industrial) machines likeMACHINE,TIMESTAMP,SIGNAL1,SIGNAL2,SIGNAL3,... 1,2015-01-01 11:00:01,1.0,1.1,1.2,1.3,.. 2,2015-01-01 11:00:01,2.2,2.1,2.6,2.8,. 3,2015-01-01 11:00:01,1.1,1.2,1.3,1.3,. 1,2015-01-01 11:00:02,1.0,1.1,1.2,1.4,. 1,2015-01-01 11:00:02,1.3,1.2,3.2,3.3,.. ...Below a system, written in Python, that reads data from a stream (use the command “nc -lk 9999” to send data to the stream) and every 10 seconds collects alerts from signals: at least 4 suspicious values of a specific signal of the same machine``` from pyspark import SparkContext from pyspark.
Bigdata
TwitterPopularTags.scala example of Apache Spark Streaming in a standalone project
This is an easy tutorial of using Apache Spark Streaming with Scala language using the official TwitterPopularTags.scala example and putting it in a standalone sbt project. In few minutes you will be able to receive streams of tweets and manipulating then in realtime with Apache Spark Streaming
Install Apache Spark (I used 1.5.1) Install sbt git clone https://github.com/matteoredaelli/TwitterPopularTags cd TwitterPopularTags cp twitter4j.properties.sample twitter4j.properties edit twitter4j.properties sbt package spark-submit –master local –packages “org.
Bigdata
A comparison of Hadoop distributions
Read all the article at http://www.hadoop360.com/blog/hadoop-whose-to-choose
Bigdata
Apache Spark news from a Spark Summit 2015
GOAL: unified engine across data sources, workloads and environments.
Highlights: dataframes (1.3), SparkR (1.4), …
See all video and slides at http://spark-summit.org
Bigdata
Hortonworks, IBM and Pivotal begin shipping standardized Hadoop
“Hortonworks, IBM and Pivotal begin shipping standardized Hadoop. The standardization effort is part of the Open Data Platform initiative, which is an industry effort to ensure all versions of Hadoop are based on the same Apache core..”. Read all the full zdnet.com article This is thanks to the Apache Ambari project!
Cloudera still goes alone with its old custom solution. And MapR bets on Apache Mesos (see https://lnkd.in/e_ZwNn5) I also suggest to install Ambari/Hadoop in a Docker container.
Bigdata
Hadoop on Twitter 2015-02
Below the top media (and other statistics) about “Hadoop” search in Twitter in February 2015
Bigdata
Statistics of #Hadoop #Opensource #Expo2015 tweets in January 2015
How easy is analyzing json twitter data using Apache Spark and Apache Hadoop. Below some examples Tweets about Expo2015 Tweets about Hadoop Tweets about opensource