Add sink timestamp section into file names #173

chuwy · 2020-04-22T14:04:53Z

Currently file names are generated here: https://github.com/snowplow/snowplow-s3-loader/blob/master/src/main/scala/com.snowplowanalytics.s3/loader/KinesisS3Emitter.scala#L150 with only year/month/day and Kinesis sequence number, which are mostly useless if we want to reply some set of data from S3.

/cc @istreeter

colmsnowplow · 2020-04-22T17:04:46Z

There is another purpose to this requirement, and an important one. If the path doesn't follow a particular convention, then it is not possible to partition the data and limit queries via Athena.

Ideally the S3 loader will be able to create files in a manner which allows one to easily create partitioned athena tables, and only load certain partitions. (the old format run= convention served this purpose well).

Edit for clarity: The convention to follow is key=value in the name, eg. run=2020-04-23. I'm unsure at present whether this must be a directory name, or this convention in the filename is sufficient.

benjben · 2020-05-01T12:44:16Z

We could also use this parameter to partition data by date.

benjben · 2020-12-22T09:59:36Z

2020-12-22-49613548169053493378838656625866917741098839399456571394-49613548169053493378838657029012246511267725570343960578.gz becomes 2020-12-22-125000-49613548169053493378838656625866917741098839399456571394-49613548169053493378838657029012246511267725570343960578.gz where 125000 is for 12:50:00

benjben added this to the Version 0.7.1 milestone Dec 4, 2020

benjben added a commit that referenced this issue Dec 11, 2020

Add sink timestamp section into file names (close #173)

32b0801

benjben mentioned this issue Dec 11, 2020

Bumps and fix partitioning for bad rows #191

Merged

benjben added a commit that referenced this issue Dec 15, 2020

Add sink timestamp section into file names (close #173)

9b2a00f

benjben added a commit that referenced this issue Dec 22, 2020

Add sink timestamp section into file names (close #173)

13101b0

benjben closed this as completed in c908112 Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sink timestamp section into file names #173

Add sink timestamp section into file names #173

chuwy commented Apr 22, 2020

colmsnowplow commented Apr 22, 2020 •

edited

Loading

benjben commented May 1, 2020

benjben commented Dec 22, 2020 •

edited

Loading

Add sink timestamp section into file names #173

Add sink timestamp section into file names #173

Comments

chuwy commented Apr 22, 2020

colmsnowplow commented Apr 22, 2020 • edited Loading

benjben commented May 1, 2020

benjben commented Dec 22, 2020 • edited Loading

colmsnowplow commented Apr 22, 2020 •

edited

Loading

benjben commented Dec 22, 2020 •

edited

Loading