Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
dgraph
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Wiki
Code
Merge requests
0
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Mirror
dgraph
Commits
63134e90
Commit
63134e90
authored
9 years ago
by
Manish R Jain
Browse files
Options
Downloads
Patches
Plain Diff
Initial checkin for design doc
parent
0ca8409f
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
design.md
+62
-0
62 additions, 0 deletions
design.md
with
62 additions
and
0 deletions
design.md
0 → 100644
+
62
−
0
View file @
63134e90
# Dgraph Design Doc
To explain DGraph working, let's start with an example query we should be able
to run.
Query to answer:
Find all posts liked by friends of friends over the last year, written by a popular author X.
In a SQL / NoSQL database, this would require you to retrieve a lot of data:
Method 1:
*
Find all the friends (~ 200 friends)
*
Find all their friends (~ 200
*
200 = 40,000 people)
*
Find all the posts liked by these people (resulting set in millions).
*
Intersect these posts with posts authored by person X.
Method 2:
*
Find all posts written by person X over the last year (possibly thousands).
*
Find all people who liked those posts (easily millions) = result set 1.
*
Find all your friends
*
Find all their friends = result set 2.
*
Intersect result set 1 with result set 2.
Both of these approaches, would result in a lot of data going
back and forth between database
and application. Lots of network calls.
Or, would require you to run a MapReduce.
This distributed graph serving system is designed to make queries like these
run in production, with production level latency.
This is how it would run:
*
Node X contains posting list for predicate
`friends`
*
Seek to caller's userid in Node X (1 RPC). Retrieve a list of friend uids.
*
Do multiple seeks for each of the friend uids, to generate a list of
friends of friends uids.
**[result set 1]**
*
Node Y contains posting list for predicate
`posts_liked`
.
*
Ship result set 1 to Node Y (1 RPC), and do seeks to generate a list
of all posts liked by result set 1.
**[result set 2]**
*
Node Z contains posting list for predicate
`author`
.
*
Ship result set 2 to Node Z (1 RPC). Seek to author X, and generate a list of
posts authored by X.
**[result set 3]**
*
Intersect the two sorted lists, result set 2 and result set 3.
**[result set 4]**
*
Node N contains names for all uids.
*
Ship result set 4 to Node N (1 RPC),
and convert uids to names by doing multiple seeks.
**[result set 5]**
*
Ship result set 5 back to caller.
In 4-5 RPCs, we have figured out all the posts liked by friends of friends,
authored by person X.
What if a predicate's posting list is too big to fully store on 1 node?
It can be split sharded to fit on multiple machines,
just like Bigtable sharding works. This would mean a few additional RPCs, but
still no where near as many as would be required by existing datastores.
This system would allow infinite scalability, and yet production level latencies,
to support running complicated queries requiring deep joins.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment