Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build and populate index in parallel #86

Merged
merged 2 commits into from
Jul 21, 2020

Conversation

lnicola
Copy link
Contributor

@lnicola lnicola commented Jul 20, 2020

Fixes #85
Fixes #84

CC @eisengrau

@lnicola
Copy link
Contributor Author

lnicola commented Jul 20, 2020

Some rough measurements:

[INFO] [src/index/update.rs:27] Library index update took 322.893 seconds

# rayon within a directory
[INFO] [src/index/update.rs:28] Library index update took 172.079 seconds

# rayon both within a directory and across directories
[INFO] [src/index/update.rs:28] Library index update took 49.357 seconds

# same, warm cache
[INFO] [src/index/update.rs:28] Library index update took 14.523 seconds

I tried to drop the caches before the third measurement, but I'm not sure it worked (ZFS on Linux). Tested on a Celeron J1900 on a library with 1 200 directories and 12 000 songs.

@lnicola lnicola force-pushed the parallel-index branch 3 times, most recently from 8f5d1e7 to 3c07046 Compare July 20, 2020 13:50
@@ -322,8 +341,8 @@ pub fn populate(db: &DB) -> Result<()> {
Regex::new(&settings.index_album_art_pattern)?
};

let (directory_sender, directory_receiver) = channel();
let (song_sender, song_receiver) = channel();
let (directory_sender, directory_receiver) = crossbeam_channel::unbounded();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to crossbeam-channel because it's Sender is Sync (and it's also faster and more reliable than the std one).

let song_tags = song_files
.into_par_iter()
.filter_map(song_metadata)
.collect::<Vec<_>>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid the allocation here by doing a fold.

Copy link
Owner

@agersant agersant Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for me and very readable in this form. I like this a lot better than what it was before.

Ok(())
sub_directories
.into_par_iter()
.map(|sub_directory| self.populate_directory(Some(path), &sub_directory))
Copy link
Owner

@agersant agersant Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, could we use .for_each() here instead of map and collect? I'm also unsure how this collect() compacts the Iter of results into a single value.

Copy link
Contributor Author

@lnicola lnicola Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same as the similar Iterator method. It propagates one of the Err results (note how you were using ? before, exiting at the first error). With my other error printing change, we should get an error if the channel was closed. It probably doesn't matter much, but it seemed more clear to exit like this than to not handle the errors.

I'll add a comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.iter()
.par_bridge()
.map(|target| updater.populate_directory(None, target.as_path()))
.collect::<Result<()>>()?;
Copy link
Owner

@agersant agersant Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the same collect magic! Does something implement Into<Result> for Iter<Result>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it returns an error if one of the mounts failed.

@agersant
Copy link
Owner

Left one minor question before merging, but this is a great change! Very promising numbers - although I assume it might be especially beneficial on the lower end CPU.

@lnicola
Copy link
Contributor Author

lnicola commented Jul 21, 2020

I think it's beneficial on slower drives because you get more IO operations at once. Even mechanical drives are fastest at 8 concurrent IOs or so.

On my CPU this still doesn't go above 50-80% overall usage (over the four coures), even with ZFS trying to uncompress the files.

@eisengrau
Copy link

Hello,

So yesterday I added my collection of about 200k mp3 files at the time of opening #85. Unforunately after a couple of hours it 'timed out' so it idexed just a few folders of it. I clikced on scan now again, once I restarted the container, it is the same CPU usage as yesterday, around 3% max. The music collection is on ZFS on linux (LZ4 compression) with an i5-4590.

Maybe this will fix it. I'm eagerly waiting for a new build. :)

@lnicola
Copy link
Contributor Author

lnicola commented Jul 21, 2020

Even if you closed the browser, the scan should have ran in the background (it runs periodically anyway).

I'm not familiar with LXC, do you have the logs? They're normally printed to the console, but they might get redirected somewhere else if you run it in the background. There's no logs for files scanned successfully, but you might see some errors. You can also strace it to get an indication of progress.

ZFS on linux

😄

@eisengrau
Copy link

It's definately doing something:
image
image

I'll get back and try to find some logs once it stops. Does the polaris binary generate any, or at the moment it just runs in the background once launched?

@lnicola
Copy link
Contributor Author

lnicola commented Jul 21, 2020

teddybear

😄

I'll get back and try to find some logs once it stops. Does the polaris binary generate any, or at the moment it just runs in the background once launched?

It doesn't write to a file, but to console. On my distro I see them with journalctl and in Docker docker logs probably works, but I'm not sure what Debian and LXC do.

@agersant
Copy link
Owner

agersant commented Jul 21, 2020

There's no logs for files scanned successfully

There is a log entry at the end of each scan, you've been using it for your benchmarks @lnicola :D

On Linux, running without the -f flag makes the output go to a polaris.log file under your $XDG_CACHE_HOME directory. Running with -f makes the process not daemonize and output to console.

@agersant agersant merged commit 17976dc into agersant:master Jul 21, 2020
@lnicola lnicola deleted the parallel-index branch July 21, 2020 08:47
@agersant
Copy link
Owner

As a datapoint @eisengrau, I run Polaris on a Raspberry Pi 3 with about 80k songs on a USB HDD and the initial indexing (cold cache) takes about 10-15 minutes IIRC.

@lnicola
Copy link
Contributor Author

lnicola commented Jul 21, 2020

@agersant I'm curious what effect this PR has on that.

@eisengrau
Copy link

Ok, I ran polaris -f, I could now see what has stopped the scanning:

10:09:26 [ERROR] [src/index/mod.rs:90] Error while updating index: Invalid directory path

Could be directories/file names with accented/invalid characters?

@lnicola
Copy link
Contributor Author

lnicola commented Jul 21, 2020

I think I fixed an issue causing it to give up on some errors, can you try with the latest version?

Could be directories/file names with accented/invalid characters?

Only UTF-8 works.

@eisengrau
Copy link

Should I git clone and recompile again, since previously I just unpacked the latest archive and used that.

@lnicola
Copy link
Contributor Author

lnicola commented Jul 22, 2020

Clone, cargo build --release and take the binary from target/release/polaris.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

could not compile rocket_http Update Cargo deps for Pear/Rocket Rust Nightly regression workaround
3 participants