-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web types #21
Web types #21
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a more in depth look at it this evening. I'm not sure DOS is going to be a big issue, I guess if the user passes in a giant buffer then it could be!? We may want to enforce the 8192 bytes limit on Infer::get
then, since there could be other matchers that are vulnerable to it?
If performance is really an issue here (I haven't benchmarked it yet), we may want to play with something like https://rust.godbolt.org to see if we can make the compiler elide bounds checks, optimize out unreachable branches or even get it to optimize the loops by making it generate SIMD instructions? Here's what the situation is like at the moment, if you feel like it's not fast enough and want to play with it https://rust.godbolt.org/z/o7rz1W
Anyway, I'll test this later, but looking at the current implementation it should be fine. We don't need to get extreme performance here, and I certainly wouldn't want to add any unsafe code...
src/map.rs
Outdated
( | ||
MatcherType::APP, | ||
"application/javascript", | ||
"js", | ||
matchers::no_match | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want this. I feel like this would be something the mime_guess crate should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using Infer
in a web context where files have both a name (with extension) and streamed content (from network or local disk). When sniffing mime the priority is given to the content, and I fallback to file extension if the match fails. For some types, such Javascript or CSS, the sniffing can only be done using the file's extension.
I was thinking that infer would address both cases. If i have to use mime_guess
in addition this is ok, but it seems to be unmaintained. I have a PR to fix a critical panic opened there for a month by the way.
Also if we decided that file extension matching is out of Infer scope then we should probably remove the Infer::get_from_ext
method that I introduced in #16. On the other side If we decide that file extension matching is within Infer
scope then I have several enhancements planned. For instance we could handle mime type aliases as well as file extension aliases.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think functionality like mime_guess
's could fit into the infer crate, or at least we could argue it would have a similar api. Maybe the builtin extension matchers would be in a giant match
instead of an Iterator but still...
I also use both infer
and mime_guess
, and I could benefit by having something like custom matchers for extensions, since my infer function currently looks like something like this:
pub fn infer<P: AsRef<Path>>(data: &[u8], path: P) -> String {
// using infer from the master branch
if let Some(kind) = infer::get(&data) {
// look at the comment at the bottom on why I have to use String
return kind.mime_type().into();
}
let path = path.as_ref();
// this is a bit ugly
let ext = path
.extension()
.and_then(|ext| ext.to_str())
.unwrap_or_default()
.to_lowercase();
match ext.as_str() {
// some RAW Image formats
"arw" => "image/x-sony-arw".into(),
"cr2" => "image/x-canon-cr2".into(),
"crw" => "image/x-canon-crw".into(),
"dng" => "image/x-adobe-dng".into(),
"erf" => "image/x-epson-erf".into(),
"nef" => "image/x-nikon-nef".into(),
"orf" => "image/x-olympus-orf".into(),
"rw2" => "image/x-panasonic-rw2".into(),
// unfortunately I have to return a String, even though everything else is a &'static str
// because mime_guess returns a lifetime that doesn't last enough, even though they
// could return a &'static str if they didn't depend on the `mime` crate
ext => mime_guess::from_ext(ext)
.first_or_octet_stream()
.essence_str()
.into(),
}
}
I think we only need to think if this is the best solution. We already have the mime_type and the extension for formats who also have a matcher, probably the biggest issue here is the pub(crate) fn no_match(_buf: &[u8]) -> bool { false }
"hack".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to think if this is the best solution. We already have the mime_type and the extension for formats who also have a matcher, probably the biggest issue here is the
pub(crate) fn no_match(_buf: &[u8]) -> bool { false }
"hack".
I share your concern about the "hack", but it does the job without the need for further internal refactoring at this stage.
Maybe we can keep it for now until we go further into a mime_guess & infer merge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it. What we could do is have the matcher inside infer::Type be an Option that takes None for the extension only matchers. Then we could put those matchers in a separate file from the magic bytes matchers, so that we could code generate them like mime_guess does without having it be a huge mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But still I'm not sure about it, it feels like one crate is doing the work two different crates should do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am totally open to have mime_guess
address extensions sniffing while infer
would address content sniffing. In such case we should backout #16 which does not make much sense if we do not merge mime_guess into infer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose then that we move this discussion to a dedicated issue and that I remove the CSS and Javascript support with the no_match
hack for now, so that we can merge this PR.
In the new issue we will discuss what features should Infer
provide regarding file extensions and eventually decide to remove get_from_ext
.
Is that ok for you ?
It makes sense indeed to have this addressed globally. I opened #22 to track this outside of this PR.
Very interesting, thank you for sharing. I do not have performance issues related to |
Personally I run it on a background worker, so I don't mind having it be a bit slower. In some cases I myself pass the entire file to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested an optimization
Before: https://rust.godbolt.org/z/jGv46j
After: https://rust.godbolt.org/z/dfaocx
I think this is ready, but as discussed I don't think We should be able to get the |
Regarding my comment above in the code review I understand that:
Is my understanding correct ? |
Ops sorry didn't see your second comment. Right, let's open an issue about that so we can further discuss support for extension matching. For the moment let's leave #16 as is and decide before the release what to do with it |
A 30x optimization making the compiler optimize away a panic were &buf[i..] could be an invalid index Co-authored-by: Paolo Barbolini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @ririsoft for all of the work. Merging now
Here we go, hopping I did not misunderstood your points. |
I gave another look at the mime_guess repo and it's a bit sad seeing how he's ignoring every PR and not doing anything with it. Anyway, it looks like there are already some alternatives to it https://crates.io/crates/mime-db, but very few crates are using it |
I just opened #24 to continue the discussion on the file extension topic. |
Add some common web types support, using the whatwg specification.
Since the algorithm is stripping white spaces I am wondering wether we should limit the buffer size to 256 or 512 bytes (like other implementations do, Go http lib among others) or live this to the responsibility of the caller. Not doing this make this crate vulnerable to DOS attacks with very long space strings. Doing this limits the possibilities for the calling application. Maybe we should leave that to the caller but add a small comment about the danger. Comments welcome.
Cheers