Skip to content

Add sink timestamp section into file names #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chuwy opened this issue Apr 22, 2020 · 3 comments · Fixed by #191
Closed

Add sink timestamp section into file names #173

chuwy opened this issue Apr 22, 2020 · 3 comments · Fixed by #191
Milestone

Comments

@chuwy
Copy link
Contributor

chuwy commented Apr 22, 2020

Currently file names are generated here: https://github.com/snowplow/snowplow-s3-loader/blob/master/src/main/scala/com.snowplowanalytics.s3/loader/KinesisS3Emitter.scala#L150 with only year/month/day and Kinesis sequence number, which are mostly useless if we want to reply some set of data from S3.

/cc @istreeter

@colmsnowplow
Copy link

colmsnowplow commented Apr 22, 2020

There is another purpose to this requirement, and an important one. If the path doesn't follow a particular convention, then it is not possible to partition the data and limit queries via Athena.

Ideally the S3 loader will be able to create files in a manner which allows one to easily create partitioned athena tables, and only load certain partitions. (the old format run= convention served this purpose well).

Edit for clarity: The convention to follow is key=value in the name, eg. run=2020-04-23. I'm unsure at present whether this must be a directory name, or this convention in the filename is sufficient.

@benjben
Copy link
Contributor

benjben commented May 1, 2020

We could also use this parameter to partition data by date.

@benjben
Copy link
Contributor

benjben commented Dec 22, 2020

2020-12-22-49613548169053493378838656625866917741098839399456571394-49613548169053493378838657029012246511267725570343960578.gz becomes 2020-12-22-125000-49613548169053493378838656625866917741098839399456571394-49613548169053493378838657029012246511267725570343960578.gz where 125000 is for 12:50:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants