Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Point.Builder is memory intensive - support plain line protocol insertion #149

Closed
billevans1963 opened this issue Feb 24, 2016 · 12 comments

Comments

@billevans1963
Copy link

I have been using the maven central 2.1 version of influxdb-java and am finding that the Point.Builder mechanism to insert records is unnecessarily memory intensive: Every point allocates 2 new Maps to build a single String line protocol entry. When you are inserting hundreds of thousands of records at a time, this is not particularly efficient.

Yesterday I discovered that the latest github source contains methods to write Line Protocol directly, so I downloaded it and modified my code to write directly. It seems a much better way to go.

Two questions: Do you intend to develop the write(String) method further, by for example enabling a similar kind of batching that you use for Points? When do you plan to push the next version onto Maven Central?

@andrewdodd
Copy link
Contributor

Hi @billevans1963

I guess that it a bit inefficient now that I think about it.

Could you shed any light on how you are building the line protocol strings without going via the builder?

Additionally, if you have any input on #148, especially around the implication of #126 to batching, it would be great if you could comment.

@andrewdodd
Copy link
Contributor

Any chance you have an example test case that demonstrates this issue? I.e. a case using Point.Builder and a case without?

@billevans1963
Copy link
Author

Hi @andrewdodd,

I replaced Point.Builder and 'enableBatching' with the following very simple class:

import java.util.ArrayList;
import java.util.List;

import org.influxdb.InfluxDB;
import org.influxdb.InfluxDB.ConsistencyLevel;

class InfluxBatchInserter {
    //
    private final String database;
    private final String retentionPolicyName;
    private final ConsistencyLevel consistencyLevel = ConsistencyLevel.ANY;
    private final InfluxDB influxConx;
    private final int batchSize;
    //
    private List<String> linesToWrite = new ArrayList<String>();



    InfluxBatchInserter(InfluxDB influxConx, String database, String retentionPolicyName, int batchSize) {
        this.influxConx = influxConx;
        this.database = database;
        this.retentionPolicyName = retentionPolicyName;
        this.batchSize = batchSize;
    }   
    protected void flush() {
        if (linesToWrite.size() > 0) {
            this.influxConx.write(database, retentionPolicyName, consistencyLevel, linesToWrite);
            linesToWrite.clear();           
        }
    }
    protected void write(String lineProtocol, boolean flushRegardless) {
        this.linesToWrite.add(lineProtocol);
        if (flushRegardless || linesToWrite.size() >= batchSize) {
            this.flush();           
        }
    }   
}

Whenever I want to write data I call the write method with the data already formatted to the line protocol.

@billevans1963
Copy link
Author

What I have written is a very basic batching mechanism which I doubt is as resilient as the BatchPoints mechanism you employed. So maybe a further enhancement would be a "BatchStrings" class (or other name), that would essentially manage batches of Strings instead of Points. What do you think?

@andrewdodd
Copy link
Contributor

So do you think the memory issue is in the storage of 'Point' objects that contain the 2 Map objects?

@andrewdodd
Copy link
Contributor

The original reason for the 'bare / raw strings' write method was the performance of converting to the line protocol (heavily debated in issue 126).

So I think there are actually two problems we might be talking about here:

  1. The performance of converting from a 'Point' to a line protocol formatted string
  2. The performance of storing Point objects in memory

I still have my doubts about issue 126 and thus problem 1 above. The reason being that at some point you have convert your application domain objects to line-protocol-strings, and I don't believe that the current implementation can be that bad.

However, I do agree that the current design of storing the Point objects will have the following two effects:

  • The Point objects in memory will hold refs to 2 map objects, which is probably unecessary
  • The Point objects will only be converted to Strings at the time of sending (which will be repeated if they fail); though they could be converted earlier.

Is this your understanding?

@billevans1963
Copy link
Author

Yes, that would be my understanding.

In my opinion, by-passing the Builder.Point Object creation should save big on:

  1. Memory allocations. 100,000 Points => 200,000 short-lived HashMaps.
  2. CPU. Because GC has to do less work.

@andrewdodd
Copy link
Contributor

@billevans1963 , any chance you could have a look at this PR and see if a change like that would help you?

In the test that I added, the memory performance (unsurprisingly) depends a lot on what kind of data you have.

@billevans1963
Copy link
Author

@andrewdodd I like the simplicity of new Point2 Object. However, my preference is to avoid the use of the Builder to build the lines because it still keeps allocating 2 new TreeMaps for every Point. If you made the Builder re-use it's maps it would be better.

@wcork
Copy link

wcork commented Sep 30, 2016

I may create a PR for this. A quick and dirty fix would be to just use Trove maps instead. Not sure what Point looked like in March, but using

Point pnt = Point.Builder...... build();

Looks like it allocates another pair of maps when build() is called (from calling new Point()). Possibly adding another constructor: Point(Map<String, String> tags, Map<String, String> fields) and calling that from build() will solve the issue.

Expect a PR soon.

Edit: TreeMap is allocated only in Builder. List<Pair<String, String>> instead of TreeMap should be used due to size and has a similar toString(). Linear in traversal too (So are RB trees).

@wcork
Copy link

wcork commented Sep 30, 2016

After writing it to be List<Pair<String, String/Object> compatible, I've decided that I've made this code harder to understand. There's a reason why Pair isn't made standard in Java (I had to import javax).

Playing around, I saw very small changes in RAM with switching this data-type.

However, in doing so, I read through the tests and found that InfluxDB-Java is, in fact, capable of using the line-protocol:

testWriteMultipleStringData()

...
    this.influxDB.write(dbName, rp, InfluxDB.ConsistencyLevel.ONE, "cpu,atag=test1 idle=100,usertime=10,system=1\ncpu,atag=test2 idle=200,usertime=20,system=2\ncpu,atag=test3 idle=300,usertime=30,system=3");
...

testWriteMultipleStringDataLines()

...
    this.influxDB.write(dbName, rp, InfluxDB.ConsistencyLevel.ONE, Arrays.asList(
                "cpu,atag=test1 idle=100,usertime=10,system=1",
                "cpu,atag=test2 idle=200,usertime=20,system=2",
                "cpu,atag=test3 idle=300,usertime=30,system=3"
        ));
...

Problems solved.

@majst01
Copy link
Collaborator

majst01 commented Oct 3, 2016

duplicate of #151 closing now

@majst01 majst01 closed this as completed Oct 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants