Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: loadtxt skiprows and max_rows #652

Merged
merged 15 commits into from
Jun 16, 2022

Conversation

MuellerSeb
Copy link
Contributor

@MuellerSeb MuellerSeb commented Apr 19, 2022

Hey there,

I needed to skip rows with loadtxt, so I added this feature similar to the numpy routine. In addition, I also added max_rows.

Added optional arguments:

  • skiprows: skipping rows at the beginning of the file (default 0)
  • max_rows: maximum for read lines (default -1 for all lines)

Argument names are taken from numpy.loadtxt. All negative values for max_rows are interpreted as "all lines".

nrow = number_of_rows(s)
nrow = number_of_rows(s) - skiprows_

if ( nrow < 0 ) call error_stop("loadtxt: skipping more rows than present.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be a fatal error?

Copy link
Contributor Author

@MuellerSeb MuellerSeb Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also cut off nrow at 0, which would result in an empty array.

Copy link
Contributor Author

@MuellerSeb MuellerSeb Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But numpy also throws an error for skiprows > nrow.

Copy link
Contributor Author

@MuellerSeb MuellerSeb Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gfortran also crashes when trying to read after EOF. So having a meaningful error message is preferable IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using iostat and iomsg instead of this check, and returning an appropriate message or an zero-size array when appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I am not familiar with iostat and iomsg. This would only affect the following loop, right?

do i = 1, skiprows_
  read(s, *)
end do

@awvwgk awvwgk linked an issue Apr 19, 2022 that may be closed by this pull request
Copy link
Member

@ivan-pi ivan-pi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 👍🏻 I'd invite other reviewers to voice their opinions if the skipping more rows than present should be an error or not.

@awvwgk awvwgk added reviewers needed This patch requires extra eyes topic: IO Common input/output related features labels Apr 19, 2022
@MuellerSeb MuellerSeb changed the title Feature: loadtxt skiprows Feature: loadtxt skiprows and max_rows Apr 19, 2022
@MuellerSeb
Copy link
Contributor Author

@ivan-pi thanks for the review. I just added max_rows since it would make sense to directly add it. Hope this is ok.

@MuellerSeb MuellerSeb requested a review from ivan-pi April 19, 2022 20:40
@MuellerSeb
Copy link
Contributor Author

Question from my side: Do we want to mimic the numpy behavior and signature?

@MuellerSeb MuellerSeb requested a review from jvdp1 April 27, 2022 13:39
@14NGiestas
Copy link
Member

Just a comment about error handling: there is still an open issue about it #224 (it's worth reading)

Since this function is dealing with the system (reading a file) this may fail (as both noted) and the best approach so far IMHO it is how stdlib_bitsets by wclodius2 module deals with errors: have an optional status and let the user decides if an error should terminate the application or not.

Indeed imagine a user building a GUI application where a fatal error would close the entire program (and loss of progress). Of course, we could add some boilerplate to inquire about the file's existence first (on the end-user side) but it defeats the purpose of using/having a helper function such as loadtxt in the first place.

I hope it helps.

@MuellerSeb
Copy link
Contributor Author

@14NGiestas then the suggestion with cutting of skiprows_ should be fine for this pull request, since it doesn't introduce any new error behavior. If everybody is fine with it, I would commit my suggestion for this.

@14NGiestas
Copy link
Member

@14NGiestas then the suggestion with cutting of skiprows_ should be fine for this pull request, since it doesn't introduce any new error behavior. If everybody is fine with it, I would commit my suggestion for this.

Yep, at first glance your suggestion is ok (just make sure it is mentioned in the docs and examples)
If I recall correctly returning a zero-sized array is a behavior used in stdlib_math linspace and logspace.

@14NGiestas
Copy link
Member

I believe the specs should also be updated at specs/stdlib_io

@MuellerSeb
Copy link
Contributor Author

I believe the specs should also be updated at specs/stdlib_io

Oh. Why are there two files for the same docs?!

@14NGiestas
Copy link
Member

@MuellerSeb the stdlib project has this little quirk where it mediates, in a manner, future standardized Fortran features.

The specification is a document that describes the API and the functionality, so that anyone can use it to create an implementation from scratch without looking at stdlib. The stdlib library then provides the reference implementation.

It will end up being the "same" but it serves a different purpose. This PR is a kinda undefined behavior AFAIK (since it updates a spec) and maybe we should discuss more. I'm reviewing in a good faith that is useful to have such options available but the workflow doesn't seem to be ready for this specific situation.

@jvdp1
Copy link
Member

jvdp1 commented May 8, 2022

It will end up being the "same" but it serves a different purpose. This PR is a kinda undefined behavior AFAIK (since it updates a spec) and maybe we should discuss more. I'm reviewing in a good faith that is useful to have such options available but the workflow doesn't seem to be ready for this specific situation.

I would say that it is labelled as experimental. So changing the API is ok at this stage, especially that it is also backward-compatible.

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Returning a zero-array is fine IMO. Thank you.


### Arguments

`filename`: Shall be a character expression containing the file name from which to load the rank-2 `array`.

`array`: Shall be an allocatable rank-2 array of type `real`, `complex` or `integer`.

`skiprows` (optional): Skip the first `skiprows` lines. If skipping more rows than present, a 0-sized array will be returned. The default is 0.

`max_rows` (optional): Read `max_rows` lines of content after `skiprows` lines. The default is -1, to read all the lines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about with a negative number (other than 0), or a 0 value provided for max_rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every negative number will be interpreted as: read all lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be mentioned in the specs (instead of only the value -1).

@MuellerSeb
Copy link
Contributor Author

MuellerSeb commented May 9, 2022

@jvdp1 another option for negative values for max_rows could be to to interpret them as wrap-around values. Meaning, -1 will be the full file, -2 will leave out the last line and so on. We would just need to find a solution for too large negative numbers. It should again be cut-off at 0 read lines.

@jvdp1
Copy link
Member

jvdp1 commented May 9, 2022

@jvdp1 another option for negative values for max_rows could be to to interpret them as wrap-around values. Meaning, -1 will be the full file, -2 will leave out the last line and so on. We would just need to find a solution for to big negative numbers. It should again be cut-off at 0 read lines.

In this case, I would suggest to define a parameter integer, e.g., read_all, which would be available to the user, and then to add a procedure for checking the provided value, e.g.

select case(max_rows_)
 case( 0 )
  ....
 case( : -1 )
   .... = real_all
 case default
  ....
end select

A parameter integer would provide the flexibility for the API (as the user doesn't need to rely on a specific number)

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MuellerSeb and reviewers. I made suggestions to the spec and docstrings. Anything else needed here before we merge?

MuellerSeb and others added 2 commits June 13, 2022 14:35
Co-authored-by: Milan Curcic <[email protected]>
@MuellerSeb
Copy link
Contributor Author

Hey @milancurcic, thanks for your cleanup. There is still the comment from @jvdp1 about the -1 value for max_rows and the discussion, if other negative values should skip lines at the end.

Beside that I just noticed, that if you use skiprows to skip comments in the beginning for example, the following line will still try to get the number of columns from the first line in the file:

nrow = number_of_rows(s)

I think that is bad.

@milancurcic
Copy link
Member

Beside that I just noticed, that if you use skiprows to skip comments in the beginning for example, the following line will still try to get the number of columns from the first line in the file:

nrow = number_of_rows(s)

I think that is bad.

Do you mean number of rows from the first line in the file? And if yes, I don't understand why is that not what we want. I'm not familiar with the npy format, maybe this is something specific to that.

@MuellerSeb
Copy link
Contributor Author

MuellerSeb commented Jun 13, 2022

Beside that I just noticed, that if you use skiprows to skip comments in the beginning for example, the following line will still try to get the number of columns from the first line in the file:

nrow = number_of_rows(s)

I think that is bad.

Do you mean number of rows from the first line in the file? And if yes, I don't understand why is that not what we want. I'm not familiar with the npy format, maybe this is something specific to that.

Sorry I meant columns:

      ! determine number of columns
      ncol = number_of_columns(s)

Just copied the wrong line.

If you have a file like this:

comment
0.1 0.2
0.3 0.4

And read it with

call loadtxt("log.txt", data, skiprows=1)

It will still think, that there is only one column, since the first line has only one ("comment").

@milancurcic
Copy link
Member

I see, that makes total sense. I agree that won't work. What do you think about this approach:

  • Add an optional argument row to number_of_columns which defaults to 1 (current behavior). row indicates the line number at which we'll determine the number of columns.
  • In loadtxt, if skiprows is present, move the call to number_of_columns to go after the skip-rows logic, and call number_of_columns(s, skiprows_+1).

@MuellerSeb
Copy link
Contributor Author

I see, that makes total sense. I agree that won't work. What do you think about this approach:

* Add an optional argument `row` to `number_of_columns` which defaults to 1 (current behavior). `row` indicates the line number at which we'll determine the number of columns.

* In `loadtxt`, if `skiprows` is present, move the call to `number_of_columns` to go after the skip-rows logic, and call `number_of_columns(s, skiprows_+1)`.

Thought of almost the same. We could also just add a skiprows to number_of_columns to pass skiprows_ there directly (would also be more consistent). I would move the call of number_of_columns also to have the correct value for skiprows_. I would just add these lines from loadtxt to number_of_columns:

      do i = 1, skiprows_
        read(s, *)
      end do

@milancurcic
Copy link
Member

@MuellerSeb nice, I like that even better.

@MuellerSeb
Copy link
Contributor Author

If everyone is fine with the mechanism for max_rows set with negative numbers, I think this is ready for merging.

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice catch and solution for number_of_columns!

@milancurcic
Copy link
Member

Thanks all, will merge.

@milancurcic milancurcic merged commit 3028a36 into fortran-lang:master Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reviewers needed This patch requires extra eyes topic: IO Common input/output related features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optional arguments for loadtxt
6 participants