New manpage format #1921

jnavila · 2024-11-17T21:54:48Z

Changes

This PR changes the way the asciidoc source of manpage is processed, by adding the "synopsis" paragraph style and reworking the backtick format.

Context

The style change has been pushed to master and will be applied to git-clone and git-init in the next version.

dscho · 2024-11-18T18:28:43Z

I just triggered a pair of workflow runs to update the manual pages and to update the translated manual pages, fetched the result and rendered it locally. Here are two examples:

language	before	after
English
French

Personally, I cannot spot any difference, apart from the version number (because this here PR branch is based on v2.46.2 while the updated manual pages include v2.47.0) and the incorrect =<regexp> on the "before" side of the French version (fixed on the "after" side).

Even looking at the HTML of the synopses (taking the French version, so that there is a known difference), I only see this:

diff --git a/before b/after
index 1a87d1348..6185fb72b 100644
--- a/before
+++ b/after
@@ -1,5 +1,5 @@
 <pre class="content"><em>git config list</em> [&lt;option-de-fichier&gt;] [&lt;option-d-affichage&gt;] [--includes]
-<em>git config get</em> [&lt;option-de-fichier&gt;] [&lt;option-d-affichage&gt;] [--includes] [--all] [--regexp=&lt;regexp&gt;] [--value=&lt;valeur&gt;] [--fixed-value] [--default=&lt;default&gt;] &lt;nom&gt;
+<em>git config get</em> [&lt;option-de-fichier&gt;] [&lt;option-d-affichage&gt;] [--includes] [--all] [--regexp] [--value=&lt;valeur&gt;] [--fixed-value] [--default=&lt;default&gt;] &lt;nom&gt;
 <em>git config set</em> [&lt;option-de-fichier&gt;] [--type=&lt;type&gt;] [--all] [--value=&lt;valeur&gt;] [--fixed-value] &lt;nom&gt; &lt;valeur&gt;
 <em>git config unset</em> [&lt;option-de-fichier&gt;] [--all] [--value=&lt;valeur&gt;] [--fixed-value] &lt;nom&gt; &lt;valeur&gt;
 <em>git config rename-section</em> [&lt;option-de-fichier&gt;] &lt;ancien-name&gt; &lt;nouveau-name&gt;

@jnavila what am I missing?

jnavila · 2024-11-18T21:40:44Z

The manpage of git-config has not been converted yet.
I pushed a branch "test-refactor" on git-html-l10n, where I hand-edited fr/git-add.txt.

After importing, here is the result:

I'm not satisfied with the styles, particularly when dealing with inline formats:

you can test by yourself locally, and tell me your judgment.

The new style makes the code spans lighter and more integrated into the text. The new style also makes the code spans more readable and less intrusive. Signed-off-by: Jean-Noël Avila <[email protected]>

jnavila · 2024-12-22T18:02:12Z

@dscho I updated the CSS, so it is ready for review.

To1ne

@jnavila I've added a few questions. Thanks for this contribution.

To1ne · 2024-12-26T09:26:48Z

script/asciidoctor-extensions.rb

+
+      def process parent, reader, attrs
+        outlines = reader.lines.map do |l|
+          l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2')


I think a line of comment wouldn't hurt with these regexes. Maybe best with an example:

Suggested change

l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2')

l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2') # wrap ellipsis in backticks: ...something => `...`something

I think the intended use is for [...<more>]? Should we include the [ and ] in the regex?

This line is trying to differentiate the three dots in different contexts, where they have a different meaning and require different formatting.

First there is the form <commit1>...<commit2> when describing a range of commits, where the three dots are a "keyword" understood by git and must be formatted as code.

Then there is the forms used in the grammar to express repetition, such as in "<path> ..." with optionally square brackets, such as "[<path>...]" which usually appear at the end of the command line. These three dots must not be formatted as code, but left as is.

This line matches the former case and forces the corresponding format. I'll add a comment in the same line as yours.

To1ne · 2024-12-26T09:30:19Z

script/asciidoctor-extensions.rb

+      def process parent, reader, attrs
+        outlines = reader.lines.map do |l|
+          l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2')
+           .gsub(%r{([\[\] |()>]|^)([-a-zA-Z0-9:+=~@,/_^\$]+)}, '\1{empty}`\2`{empty}')


To be honest, I don't know what this one is for.

This one is the line that matches all the words which are not placeholders and not grammar signs, and format them as code. These words (in the general sense here) are keywords (option names, enum strings, two or three dot notation, etc).

To1ne · 2024-12-26T09:43:11Z

script/asciidoctor-extensions.rb

+        outlines = reader.lines.map do |l|
+          l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2')
+           .gsub(%r{([\[\] |()>]|^)([-a-zA-Z0-9:+=~@,/_^\$]+)}, '\1{empty}`\2`{empty}')
+           .gsub(/(<([[:word:]]|[-0-9.])+>)/, '__\\1__')


I had to dig deep to find what [[:word:]] does, but it seems to be a Ruby non-POSIX bracket expression: https://docs.ruby-lang.org/en/master/Regexp.html#class-Regexp-label-POSIX+Bracket+Expressions. Personally I'm not a fan, what's the advantage over \w?

Also why are the inner brackets round brackets?

I just wonder if we can simplify to:

Suggested change

.gsub(/(<([[:word:]]|[-0-9.])+>)/, '__\\1__')

.gsub(/(<[^>]+>)/, '__\\1__')

And one more question, why the double backslash in the replacement string?

The '\w` is for ascii, but here, we are going to process internationalized texts (because placeholders are translated), and this processing requires the special form with double brackets. I'm not an expert in Ruby regexes; this is the form I have found to work well with the translations.

As for the using a more generic regex (expecting everything between brackets to be the placeholder's name), the placeholder's names are not supposed to contain spaces, which is perfect when we have to match something like:

$ git foo < in-file > out-file

To1ne · 2024-12-27T19:23:00Z

script/asciidoctor-extensions.rb

+        if node.type == :monospaced
+          node.text.gsub(/(\.\.\.?)([^\]$.])/, '<code>\1</code>\2')
+              .gsub(%r{([\[\s|()>.]|^|\]|&gt;)(\.?([-a-zA-Z0-9:+=~@,/_^\$]+\.{0,2})+)}, '\1<code>\2</code>')
+              .gsub(/(&lt;([[:word:]]|[-0-9.])+&gt;)/, '<em>\1</em>')


So we more or less need to repeat the regexes here?

That's unfortunate, but the two regex are very alike, except that this one processes the text after some pre-processing steps, and the transformations need to be in final result form (with tags and escaped characters).

I evaluated the opportunity for factorization, but it makes the code more messy than it is already.

dscho

I am really uneasy with this large amount of hard-to-understand regular expressions. Not only makes this bugs easy to hide, it also inadvertently opens the door to DoS attacks. Here is an example where something like this has had a really high impact.

It would probably make much more sense to implement a StringScanner-based parser that is much easier to reason about and whose performance is well-understood, e.g. following this tutorial.

jnavila · 2025-01-04T21:55:01Z

These regexes are basically the same ones that I already pushed to git/git. They are applied during the conversion phase, not in live, and only on the quoted strings of text, after the initial asciidoc parsing has been performed.

Anyway, as they are cryptic, I can try to convert the code to a parser, but I doubt this will be a lot clearer.

jnavila · 2025-01-25T18:46:56Z

Writing a parser turns out to be more involving than expected, mainly because the new parser must be resilient to formatting mistakes in the translations (particularly in Chinese) and manpages that were not converted to the new format (where the backticks are used for widely varying strings).

dscho · 2025-01-26T12:19:06Z

A parser would at least document a lot more cohesively what is being done.

As they are, I know what the regular expressions do, but only after studying them extensively, an effort that I would most likely have to repeat were I in the need to fix any bugs in that code in the future. I am extremely uncomfortable with that, source code should be as obvious as possible, and this is not it. As such, I will not review this any further and not merge it. I won't oppose anybody else merging it, but I want nothing to do with the added code, myself.

This commit adds a upcoming manpage format to the AsciiDoc backend. The new format changes are: * The synopsis is now a section with a dedicated style. This "synopsis" style allows to automatically format the keywords as monospaced and <placeholders> as italic. * the backticks are now used to format synopsis-like syntax in inline elements. The parsing of synopsis is done with a new AsciiDoc extension that makes use of the PEG parser parslet. All the asciidoc manpages sources are processed with this extension. It may upset the formatting for older manpages, making it not consistent across a page, but this will be a mild side effect, as this was not really consistent before. Signed-off-by: Jean-Noël Avila <[email protected]>

jnavila · 2025-01-27T20:48:48Z

I've pushed an updated version with a parser. In case the parser fails, we just roll back to the basis code format.

jnavila · 2025-02-07T18:50:02Z

@dscho Is this version better for you?

dscho · 2025-02-07T18:51:54Z

@jnavila I am sorry, I had not planned any involvement in this anymore, due to time constraints.

To1ne · 2025-03-12T09:09:33Z

Ah cool, I just realize you're trying to fix #1972 here.

To1ne · 2025-03-12T09:14:46Z

@jnavila I've noticed we have a synopsis processor at https://github.com/git/git/blob/master/Documentation/asciidoctor-extensions.rb.in. Why did you chose to write it yourself? I rather not maintain 2 versions.

I've used that extension locally (without the postprocessor) and this is how the synopsis on git-clone looks:

Or on git-diff-index:

I feel it's very much a "drop-in" extension we can/should use.

jnavila · 2025-03-12T13:03:24Z

@jnavila I've noticed we have a synopsis processor at https://github.com/git/git/blob/master/Documentation/asciidoctor-extensions.rb.in. Why did you chose to write it yourself? I rather not maintain 2 versions.

In fact, I wrote both of the extensions.. 😆 . The rewrite here is a request from @dscho which is very convincing to me. The synopsis is a language in itself, so it is more comprehensive to have a dedicated parser for this language instead of a bunch of repelling regexps.

This rewrite came after I proposed the regexp version for inclusion in git/git . Now, I'm contemplating reverting the git/git version to the parser flavor, and at the same time removing this ugly pre-processor hack by a38edab.

dscho

Wow. This sure is a lot more verbose, but vastly easier to understand than the regular expression-based solution. Thank you for putting in the work.

I have a couple of questions/suggestions below, and also the request: Please rebase to current gh-pages (resolving the merge conflict in Gemfile, too). Thank you!

dscho · 2025-03-12T14:55:06Z

script/asciidoctor-extensions.rb

+      rule(:space)      { match('[\s\t\n ]').repeat(1) }
+      rule(:space?)     { space.maybe }
+      rule(:keyword) { match('[-a-zA-Z0-9:+=~@,\./_\^\$\'"\*%!{}#]').repeat(1) }
+      rule(:placeholder) { str('<') >> match('[[:word:]]|-').repeat(1) >> str('>') }
+      rule(:opt_or_alt) { match('[\[\] |()]') >> space? }
+      rule(:ellipsis) { str('...') >> match('\]|$').present? }
+      rule(:grammar) { opt_or_alt | ellipsis }
+      rule(:ignore) { match('[\'`]') }


I have to admit that I was quite thrown by the syntax, especially the >> one. https://kschiess.github.io/parslet/parser.html to the rescue (which we might want to link to, in a code comment).

Narrators voice: The >> indicates a "simple sequence", for example str('...') >> match('\]|$').present? means "first match three periods, then ensure that they are either followed by a closing bracket or they are at the end.

dscho · 2025-03-12T14:57:07Z

script/asciidoctor-extensions.rb

+      rule(:ignore) { match('[\'`]') }
+
+      rule(:token) do
+        grammar.as(:grammar) | placeholder.as(:placeholder) | space.as(:grammar) |


Shouldn't this be space.as(:space)?

grammar and space are left unchanged, so I put them in the same bag.

You mean grammar is handled by this line and this line by leaving the original text unchanged?

It still might be clearer to introduce a corresponding rule(space: simple(:space)) { space.to_s } line (and to let SynopsisQuoteToHtml5 inherit from SynopsisQuoteToAdoc, overriding only keyword and placeholder).

dscho · 2025-03-12T14:59:24Z

script/asciidoctor-extensions.rb

+      rule(:space)      { match('[\s\t\n ]').repeat(1) }
+      rule(:space?)     { space.maybe }
+      rule(:keyword) { match('[-a-zA-Z0-9:+=~@,\./_\^\$\'"\*%!{}#]').repeat(1) }
+      rule(:placeholder) { str('&lt;') >> match('[[:word:]]|-').repeat(1) >> str('&gt;') }


If this is the only difference to AdocSynopsisQuote, why not use class EscapedSynopsisQuote <AdocSynopsisQuote and override just this rule?

Ah, I haven't tried to use inherit between my classes.

I hope it will work, otherwise I'd have caused you a lot of effort for nothing in return.

dscho · 2025-03-12T14:59:44Z

script/asciidoctor-extensions.rb

+      rule(:ignore) { match('[\'`]') }
+
+      rule(:token) do
+        grammar.as(:grammar) | placeholder.as(:placeholder) | space.as(:grammar) |


Again, should this be space.as(:space)?

To1ne · 2025-03-12T15:41:50Z

In fact, I wrote both of the extensions.. 😆 . The rewrite here is a request from @dscho which is very convincing to me. The synopsis is a language in itself, so it is more comprehensive to have a dedicated parser for this language instead of a bunch of repelling regexps.

@jnavila I'm sorry for dropping in completely ignorant!

This rewrite came after I proposed the regexp version for inclusion in git/git . Now, I'm contemplating reverting the git/git version to the parser flavor

I haven't looked at it in detail yet, but at first sight that makes sense.

and at the same time removing this ugly pre-processor hack by a38edab.

Do you mean git/git@a38edab ? If you mean the hack to fill in @GIT_VERSION@, I agree it would be nice to get rid of it. Although it's probably not directly related to the implementation details of the parser.

That said, can we deduplicate having a synopsis parser here and one in git/git.git ?

To1ne · 2025-03-13T06:32:19Z

After importing, here is the result:
...
I'm not satisfied with the styles

@jnavila Without thinking about how to achieve this, how do you prefer the end result will look like? In a quick mockup I've ended up with the following:

Is that something you'd like?

jnavila · 2025-03-13T10:18:39Z

That would be great 😃. The parser cannot make a difference between the "git commit" part and the options, so they would have the same style ( bold red… for instance)

dscho · 2025-03-13T12:42:49Z

The parser cannot make a difference between the "git commit" part and the options

Couldn't the parser learn that Git commands start with git and then continue with white-space followed by a command-name that matches [a-z][-0-9a-z]* (yes, git p4 contains digits)? That should make it possible to discern between commands and options quite nicely.

To1ne · 2025-05-05T10:50:52Z

@jnavila I see there's more work being done in this direction: https://lore.kernel.org/git/[email protected]/. I'm sorry for my lazy question, but what is the status of this PR? What do you/we need to drive it forward?

jnavila changed the base branch from gh-pages to main November 17, 2024 21:57

dscho had a problem deploying to github-pages November 18, 2024 17:45 — with GitHub Actions Failure

dscho changed the base branch from main to gh-pages November 18, 2024 19:01

jnavila marked this pull request as draft November 18, 2024 21:36

stylesheet: remove background and border from code spans

d9070e6

The new style makes the code spans lighter and more integrated into the text. The new style also makes the code spans more readable and less intrusive. Signed-off-by: Jean-Noël Avila <[email protected]>

jnavila force-pushed the new_manpage_format branch from 3cb6e5e to 9bd765a Compare November 30, 2024 16:16

jnavila marked this pull request as ready for review November 30, 2024 16:19

To1ne reviewed Dec 30, 2024

View reviewed changes

dscho reviewed Jan 4, 2025

View reviewed changes

jnavila force-pushed the new_manpage_format branch from 9bd765a to 87aab82 Compare January 27, 2025 14:57

jnavila mentioned this pull request Jan 27, 2025

Asciidoc: PO4A is missing a way to define how paragraphs with custom styles are managed mquinson/po4a#548

Closed

dscho mentioned this pull request Mar 12, 2025

Pages with [synopsis] not properly formatted #1972

Open

dscho linked an issue Mar 12, 2025 that may be closed by this pull request

Pages with [synopsis] not properly formatted #1972

Open

jnavila mentioned this pull request Mar 12, 2025

Build docs from .adoc sources #1973

Merged

dscho reviewed Mar 12, 2025

View reviewed changes

	l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2')
	l.gsub(/(\.\.\.?)([^\]$.])/, '`\1`\2') # wrap ellipsis in backticks: ...something => `...`something

	.gsub(/(<([[:word:]]\|[-0-9.])+>)/, '__\\1__')
	.gsub(/(<[^>]+>)/, '__\\1__')

New manpage format #1921

Are you sure you want to change the base?

New manpage format #1921

Conversation

jnavila commented Nov 17, 2024

Changes

Context

dscho commented Nov 18, 2024

jnavila commented Nov 18, 2024

jnavila commented Dec 22, 2024

To1ne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnavila Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnavila Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

dscho left a comment

Choose a reason for hiding this comment

jnavila commented Jan 4, 2025

jnavila commented Jan 25, 2025

dscho commented Jan 26, 2025

jnavila commented Jan 27, 2025

jnavila commented Feb 7, 2025

dscho commented Feb 7, 2025

To1ne commented Mar 12, 2025

To1ne commented Mar 12, 2025

jnavila commented Mar 12, 2025

dscho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

To1ne commented Mar 12, 2025

To1ne commented Mar 13, 2025

jnavila commented Mar 13, 2025

dscho commented Mar 13, 2025

To1ne commented May 5, 2025

jnavila Dec 31, 2024 •

edited

Loading

jnavila Dec 31, 2024 •

edited

Loading