Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cgi_mode to parse_header to support LF in CGI headers #166

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

paulownia
Copy link
Contributor

fixes #165

Adds cgi_mode option to parse_header method to allow bare LF line breaks in CGI headers.

Copy link
Contributor

@jeremyevans jeremyevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I have a few suggestions, please let me know what you think.

Comment on lines 176 to 177
header_line = Regexp.new(/^([A-Za-z0-9!\#$%&'*+\-.^_`|~]+):([^\r\n\0]*?)#{line_break}\z/m)
continued_header_lines = Regexp.new(/^[ \t]+([^\r\n\0]*?)#{line_break}/m)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates 2 regexps per call. Can we instead add 4 Regexp constants (2 for CGI mode and 2 for non-CGI mode), and use those constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. That makes a lot of sense, especially for a frequently called function like this. I'll update it to use constants.

@@ -26,7 +26,7 @@ module WEBrick

HEADER_CLASSES: Hash[String, untyped]

def self?.parse_header: (String raw) -> Hash[String, Array[String]]
def self?.parse_header: (String raw, ?bool cgi_mode) -> Hash[String, Array[String]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also need updating for the keyword argument, but I've never written RBS before, so I'm not sure how.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never written RBS either, but when I run rbs, it outputs the following:

 def self?.parse_header: (untyped raw, ?cgi_mode: bool) -> untyped

So I think it would be written like this:

 def self?.parse_header: (String raw, ?cgi_mode: bool) -> Hash[String, Array[String]]

paulownia and others added 4 commits March 25, 2025 11:40
@paulownia
Copy link
Contributor Author

I've applied the suggested changes.

Copy link
Contributor

@jeremyevans jeremyevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@jeremyevans jeremyevans requested a review from ioquatix March 25, 2025 14:12
REGEXP_CONTINUED_HEADER_LINE = /^[ \t]+([^\r\n\0]*?)\r\n/m
REGEXP_CONTINUED_CGI_HEADER_LINE = /^[ \t]+([^\r\n\0]*?)\r?\n/m

def parse_header(raw, cgi_mode: false)
Copy link
Member

@ioquatix ioquatix Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a public interface change? or is it internal to CGIHandler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a public interface change. However, adding an optional keyword argument should be a backwards compatible change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, I'd like to set the bar a little higher on the naming of cgi_mode & related documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the general state of WEBrick's documentation, lack of documentation hardly seems like a blocker (though documentation improvements are obviously welcomed). If you don't like the argument name, please pick a new one (allow_bare_lf?) and I'm sure we can switch to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I've updated the comment for parse_header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly against cgi_mode: true/false but.. bear with me and I'll explain my thinking.

The only difference between the two patterns is the line terminator right?

Correct.

I feel like it's a bit convoluted to have two completely different regex, and then use a boolean to select which ones to use. It's not very "DRY" although that isn't a strict requirement, just a general feeling.

The separate regexes are a performance optimization, so we don't need to allocate 2 regex per call.

There is almost certainly never going to be code that dynamically selects between one or the other, in other words, cgi_mode: true or cgi_mode: false is likely going to be hard coded by the caller.

Yes, but that's also true of many methods in Ruby, so I don't see why it should be a blocker.

Maybe it's not possible to change the interpretation of raw but IMHO, if we are expected to provide a list of lines, I'd change that to be an Array of lines without line termination, in other words, the CGI parser should be responsible for reading lines, not the header parser. That way, it would be the responsibility of the CGI parser to determine "what is a line" rather than trying to shoe-horn that into parse_header.

I don't want to make structural changes when they aren't necessary to fix a bug.

In addition, are there any other places (e.g. chunked encoding) where we need to handle this distinction? Maybe if we know that, we can better decide the appropriate design.

AFAIK, CGI mode is the only place. Chunked encoding requires CRLF I believe.

If this is too pedantic we can just accept cgi_mode but I hope this gives better context around my thinking and feedback.

I think it's too pendantic. IMO, we should accept this fix, and if you want to make structural changes before the next release, then do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate regexes are a performance optimization, so we don't need to allocate 2 regex per call.

If these are an implementation detail, can we make them private?

Yes, but that's also true of many methods in Ruby, so I don't see why it should be a blocker.

It's not, it's an observation to explain my position.

I don't want to make structural changes when they aren't necessary to fix a bug.

Sometimes the shortest path from A to B is not the best one.

As this is a CGI specific code path, my preference is for this code not to leak outside CGIHandler. I'd like to hear back from @paulownia but Jeremy I don't mind if you merge this after that. I am not planning on fixing WEBRick's design issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation. I understand the point about separating line reading from header parsing, and keeping CGI-specific code within CGIHandler. I agree that a cleaner design would be ideal if possible.

Since no major design changes are required, I will move the CGI-related code. But would simply moving the two constants to CGIHandler be sufficient? I'm not sure this is the best approach—any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and marking them as private would also be a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the code, but it seems better to pass the Regexp itself instead of cgi_mode. This way, we can use private_constant to make the constants completely private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

CGI script using LF line breaks in headers causes error
3 participants