diff --git a/Dockerfile b/Dockerfile index a781ae2d66581336c7df41256cff18a5e0d1379b..c8eed48b27de9024f98e8b977a656c457905530e 100644 --- a/Dockerfile +++ b/Dockerfile @@ -21,7 +21,7 @@ ENV LD_LIBRARY_PATH "/usr/local/lib" # Install DGraph and update dependencies to right versions. RUN go get -v github.com/robfig/glock && \ go get -v github.com/dgraph-io/dgraph/... && \ - glock sync github.com/dgraph-io/dgraph && echo "v0.1" + glock sync github.com/dgraph-io/dgraph && echo "v0.1.0" # Run some tests, don't build an image if we're failing tests. RUN go test github.com/dgraph-io/dgraph/... diff --git a/README.md b/README.md index 8f60b62aee4f772362bdfc52c0ba0c98062c46ec..0e96975e88e3687e64c4917b2ab4c6413dc34c34 100644 --- a/README.md +++ b/README.md @@ -41,15 +41,27 @@ $ git clone https://github.com/dgraph-io/benchmarks.git ``` To load the data in bulk, use the data loader binary in dgraph/server/loader. -Loader needs 2 directories: -- mutations, where mutation commit logs are stored and -- postings, where posting lists are stored. +Loader needs a postings directory, where posting lists are stored. ``` $ cd $GOPATH/src/github.com/dgraph-io/dgraph/server/loader $ go build . && ./loader --rdfgzips=path_of_benchmarks_dir/data/rdf-films.gz,path_of_benchmarks_dir/data/names.gz --postings DIRPATH/p ``` +### Loading performance +Loader is memory bound. Every mutation loads a posting list in memory, where mutations +are applied in layers above posting lists. +While loader doesn't write to disk every time a mutation happens, it does periodically +merge all the mutations to posting lists, and writes them to rocksdb which persists them. +How often this merging happens can be fine tuned by specifying `max_ram_mb`. +Every time loader determines it exceeds this threshold, it would *stop the world*, and start the merge process. +The more memory is available for loader to work with, the less merging needs to be done, the faster the loading. + +Thus, loader performance is highly dependent on merging performance, which depends on how fast the underlying persistent storage is. +So, *RAMFS/TMPFS > SSD > Hard disk*, when it comes to loading performance. + +As a reference point, it takes 220 seconds to load 4.1M RDFs from `names.gz`(from benchmarks repository) on my 6-core Intel Xeon Dell Precision T3500, using 1G TMPFS for postings directory, and with `max_ram_mb=3000` flag set. + ## Querying Once data is loaded, point the dgraph server to the postings and mutations directory. ```