Skip to content
This repository was archived by the owner on May 3, 2019. It is now read-only.

Commit 561d749

Browse files
committed
Added README
1 parent bfa06e8 commit 561d749

File tree

1 file changed

+119
-0
lines changed

1 file changed

+119
-0
lines changed

README.md

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# JSON Serde for Hive
2+
3+
## Features
4+
5+
* Full support for arrays, maps and structures
6+
* Automatic column to field mapping using table DDL
7+
* Map keys are case-insensitive for convenience
8+
* Optional ignoring of bad records
9+
10+
## Setup
11+
12+
Compile using `mvn clean package`, or download the release JAR:
13+
14+
curl -L http://bit.ly/mRYaNB > hive-serde-1.0.jar
15+
16+
Register the JAR with Hive:
17+
18+
add jar hive-serde-1.0.jar;
19+
20+
## Examples
21+
22+
### Simple Table
23+
24+
Create the table:
25+
26+
CREATE EXTERNAL TABLE message (
27+
messageid string,
28+
messagesize int
29+
)
30+
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
31+
LOCATION '/tmp/json';
32+
33+
Corresponding JSON record:
34+
35+
{
36+
"messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
37+
"messageSize": 12345
38+
}
39+
40+
Notice that the JSON field names can contain upper case characters.
41+
42+
### Ignoring Errors
43+
44+
Create a table and set the `errors.ignore` serde property:
45+
46+
CREATE EXTERNAL TABLE message (
47+
messageid string,
48+
messagesize int
49+
)
50+
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
51+
WITH SERDEPROPERTIES ('errors.ignore' = 'true')
52+
LOCATION '/tmp/json';
53+
54+
With the default `errors.ignore` value of `false`, an error in any record
55+
will cause the entire query to fail.
56+
57+
When set to `true`, if a record has errors, then every column for that
58+
record will be `NULL`. This is a limitation of the Hive serde API.
59+
Unfortunately, it is not possible for the serde to cause Hive to skip the
60+
record entirely. However, if you have a column that is never `NULL`, such
61+
as the primary key, you can use this column to filter out bad records:
62+
63+
SELECT * FROM message WHERE messageid IS NOT NULL;
64+
65+
This logic can be encapsulated into a view:
66+
67+
CREATE VIEW v_message AS
68+
SELECT * FROM message WHERE messageid IS NOT NULL;
69+
70+
### Nested Structures
71+
72+
Create the table:
73+
74+
CREATE EXTERNAL TABLE message (
75+
messageid string,
76+
messagesize int,
77+
sender string,
78+
recipients array<string>,
79+
messageparts array<struct<
80+
extension: string,
81+
size: int
82+
>>,
83+
headers map<string,string>
84+
)
85+
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
86+
LOCATION '/tmp/json';
87+
88+
Corresponding JSON record:
89+
90+
{
91+
"messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
92+
"messageSize": 12345,
93+
"sender": "[email protected]",
94+
"recipients": ["[email protected]", "[email protected]"],
95+
"messageParts": [
96+
{
97+
"extension": "pdf",
98+
"size": 4567
99+
},
100+
{
101+
"extension": "jpg",
102+
"size": 9451
103+
}
104+
],
105+
"headers": {
106+
"Received-SPF": "pass",
107+
"X-Broadcast-Id": "9876"
108+
}
109+
}
110+
111+
Query the table:
112+
113+
SELECT
114+
messageid,
115+
recipients[0],
116+
SIZE(recipients) AS recipient_count,
117+
messageParts[0].extension,
118+
headers['received-spf']
119+
FROM message;

0 commit comments

Comments
 (0)