Skip to content

adds base64 decoding (fixes #5656) #9157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 31, 2014

Conversation

amartgon
Copy link

fixes #5656

#############################################################################

# Based on code by Stefan Karpinski from https://github.com/hackerschool/WebSockets.jl (distributed under the same MIT license as Julia)

const b64chars = ['A':'Z','a':'z','0':'9','+','/']

const revb64chars = Dict('A'=> 0, 'B'=> 1, 'C'=> 2, 'D'=> 3, 'E'=> 4, 'F'=> 5, 'G'=> 6, 'H'=> 7, 'I'=> 8, 'J'=> 9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not really looked at the code, but wouldn't using an array be much more efficient here? You could convert the character to an Int and directly index into the array, with a few if to transform the codepoint of the character into a position in the array (i.e. one condition for upper-case letters, one for lower-case letters, one for digits, and one for + and /).

@timholy
Copy link
Member

timholy commented Nov 26, 2014

There's an implementation in the Codecs.jl package. Out of curiosity, why is it necessary to have this in base given that Codecs exists?

'e'=> 30, 'f'=> 31, 'g'=> 32, 'h'=> 33, 'i'=> 34, 'j'=> 35, 'k'=> 36, 'l'=> 37, 'm'=> 38, 'n'=> 39,
'o'=> 40, 'p'=> 41, 'q'=> 42, 'r'=> 43, 's'=> 44, 't'=> 45, 'u'=> 46, 'v'=> 47, 'w'=> 48, 'x'=> 49,
'y'=> 50, 'z'=> 51, '0'=> 52, '1'=> 53, '2'=> 54, '3'=> 55, '4'=> 56, '5'=> 57, '6'=> 58, '7'=> 59,
'8'=> 60, '9'=> 61, '+'=> 62, '/'=> 63)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll get better performance if you use an array and the integer value of the character code to lookup the values. (Be sure to use an unchecked conversion from Char to Int, so that you only check for valid range once.)

EDIT: As the values are sequential, you'll might get better performance using @nalimilan's trick with if 'A' <= x <= 'Z' y = x - 'A'

@amartgon
Copy link
Author

El 26/11/2014 10:05, Tim Holy escribió:

There's an implementation in the Codecs.jl package. Out of curiosity,
why is it necessary to have this in base given that Codecs exists?

I think the issue was created because Multimedia module (in base) uses
the base64 encoder. An then it was confusing for users not having the
corresponding decoder.

# someday reads?) into base64 encoded (decoded) data send to a stream.
# (You must close the pipe to complete the encode, separate from
# closing the target stream). We also have a function base64(f,
# Base64EncodePipe is a pipe-like IO object, which converts into base64 data sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a trailing whitespace on this line (and a few other lines), and it makes our CI fail (on purpose, because many developers sees trailing whitespace as a poor sign of code quality).

If you type mv .git/hooks/pre-commit.sample .git/hooks/pre-commit at the root of your repository, you'll get a warning when trying to commit trailing whitespace in that local repository.

@timholy
Copy link
Member

timholy commented Nov 26, 2014

Sounds reasonable, but I bet the implementation in Codecs solves the performance issues noticed here.

@amartgon
Copy link
Author

Hi again.
Thanks for the comments. I've updated the code in the following ways:

  • Improve efficiency by replacing the dictionary by an array, in he same way as is done in the Codecs module
  • removed trailing whitespaces
  • Moved the old aliases as const to deprecated.jl
  • Replaced references to the old aliases in other files.

end
@inbounds u = revb64chars[encvec[1]]
@inbounds v = revb64chars[encvec[2]]
decvec = [(u << 2) | (v >> 4)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that a whole array needs to be allocated for every 4 bytes? That seems unlikely to have good performance.

@timholy
Copy link
Member

timholy commented Dec 14, 2014

Sorry this sat so long. I restarted the failed Travis job, we'll see if it has more luck this time.

@amartgon
Copy link
Author

After this modification, the arrays are created only once and reused. According to my (very basic and incomplete) performance tests the code is now about 40% faster than last version.

@amartgon
Copy link
Author

Hi! Any idea why the AppVeyor build failed? Is it related to the changes in the base64 module?

@ivarne
Copy link
Member

ivarne commented Dec 26, 2014

Hard to say. Usually Julia only changes that causes a crash is considered a bug, but as you use @inbounds that might not be the case. We have some random issues on AppVeyor so I restarted the build, to see if it fails a second time.

timholy added a commit that referenced this pull request Dec 31, 2014
@timholy timholy merged commit 3109867 into JuliaLang:master Dec 31, 2014
@timholy
Copy link
Member

timholy commented Dec 31, 2014

Sorry it took so long to merge this. Great stuff, many thanks!

@@ -246,6 +246,10 @@ const Uint128 = UInt128
@deprecate iround(x) round(Integer,x)
@deprecate iround{T}(::Type{T},x) round(T,x)

export Base64Pipe, base64
const Base64Pipe = Base64EncodePipe
const base64 = base64encode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a @deprecate base64 ..., not a const? That way people will get a warning when they call base64.

@amartgon amartgon deleted the base64_decode_#5656 branch January 20, 2015 17:08
@ScottPJones
Copy link
Contributor

Would there be any interest in adding RFC4648 compliant "URL and Filename safe" base64 encoding to this (i.e. base64url)? This is a fairly common use case...
It is only using a different array for the encoding... (i.e. -_ instead of +/).
I had to write my own, but it would be nice if it were part of Julia instead.

@tkelman
Copy link
Contributor

tkelman commented May 3, 2015

Not sure base64 encoding is really fundamental enough to need in base at all. If Codecs.jl has some existing base64 encoding functionality, that would be a good place to extend it with additional variants.

@ScottPJones
Copy link
Contributor

@tkelman Frankly, with all the other stuff in Base, (multimedia.jl?), and with base64 already in Base, that's where I thought it would have to be... I'll look at Codecs.jl [but to be honest... how many people looking for base64 encoding are going to think about codecs? ;-) ] Thanks!

@tkelman
Copy link
Contributor

tkelman commented May 3, 2015

Package discoverability is a problem. There are multiple issues open on that.

base64 is in base because no one has tried to remove stringmime which appears to be the only place in base that it gets used. The MIME types are used by IJulia and a few other places for rich output display, but I imagine they could be adapted to get their base64 encoding from an external package if they needed to. #5155 is a very slow and arduous issue to make progress on, since removing code from base is harder to do in a painless way than adding code to base. But going further in the direction of "anything anyone ever found useful enough to implement in a PR to JuliaLang/julia" is probably not ideal for long-term maintainability.

@ScottPJones
Copy link
Contributor

@tkelman I agree totally... I think there's way too much stuff there as it is! ;-) I'm totally fine with adding RFC4648 base64url to Codecs.jl, if people think it would be useful (besides me, who's already using it in my own ocde), and that is the right place.

@timholy
Copy link
Member

timholy commented May 3, 2015

Agreed with the package idea.

# Decodes a base64-encoded string
function base64decode(s)
b = IOBuffer(s)
decoded = readall(Base64DecodePipe(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. It should be readbytes, not readall. The Base64-encoded data need not produce a valid UTF-8 string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's definitely wrong... esp. since most uses of base64 encoding are to store binary data, not Unicode text...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add base64 decode/reading
7 participants