Database format

As of 11 April 2013, the AGFK collaborators believe the following characteristics are desirable:

wiki-like data history
ability to share maps
ability to submit content via an online form or through pull requests
human readable content
expert users should be able to add content without approval
need a system that avoids name collisions

We are currently using a flat file content DB but are openly soliciting for new ideas or improvements. The flat file DB has the following structure:

content
--resources.txt
--nodes
----unique_node_tag
------title.txt
------summary.txt
------topics.txt
------dependencies.txt
------questions.txt
------resources.txt
------see-also.txt
------id.txt 
--shortcuts
----unique_node_tag
------topics.txt
------dependencies.txt
------questions.txt
------resources.txt
--courses
----unique_course_tag
------title.txt
------concepts.txt

General formatting

Most of the files in the database are either plain text files or lists of field/value pairs. In the latter format, each item (e.g. a resource or a dependency) is given as an unordered list of field/value pairs. Different items are separated by one or more blank lines. See the resources list section for an example.

Any line beginning with the # symbol is a comment. For example,

# This is a comment.
some stuff      # This is not a comment.

Resources list

Roughly, the resources list in content/resources.txt contains metadata about resources, such as textbooks, papers, or online lectures, which the user is referred to. We've found that we use certain resources, such as textbooks, over and over again, while there are other resources, such as individual papers, which are only used once. In order to handle both situations, the global resources list basically defines default values for a given resource's fields, and the node-specific content/nodes/node_name/resources.txt overrides those default values. By convention, we only include resources in the global content/resources.txt if they are likely to be used multiple times.

Each resource is given as a (collection) of unordered field/value pairs, and resources are separated by blank lines. All resource entries must specify a key field, which is the tag by which the resource is referenced in the node-specific resources.txt. Other fields which are typically listed in resources.txt include:

title, the label which is shown to the user (e.g. the name of a textbook)
authors, the list of authors of the resource. (Multiple authors are separated by and.)
resource_type, the general category of the resource, e.g. paper, online lectures, etc. Currently this isn't used, but we are considering having HTML templates associated with each resource type which determine how they're rendered.
free, which indicates whether the resource is freely available
url, a URL representing the resource in question, e.g. the home page for a textbook or the welcome page for a Coursera course.
extra, or equivalently note, additional instructions to the user

The required fields are key, title, and resource_type. The other fields are all optional.

There are some other fields which may be specified here, but by convention are specified in the node-specific resources.txt file. These are described in the node-specific resources section.

Here are some example entries:

key: pgm
title: Probabilistic Graphical Models: Principles and Techniques
authors: Daphne Koller and Nir Friedman
url: http://pgm.stanford.edu/
resource_type: textbook
free: 0

key: coursera_hinton
title: Coursera: Neural Networks for Machine Learning
authors: Geoffrey Hinton
url: https://www.coursera.org/course/neuralnets
resource_type: online lectures
free: 1
note: Click on "Preview" to see the videos.

Concept nodes

Each concept node lives in a subdirectory of content/nodes. The concept has two identifiers: a human-readable tag which is used in the hand-annotated dependencies and see-also links, and a unique identifier used in the databases. The latter should stay fixed even if the human-readable tag is modified. This way, any graphs a user has saved will still be consistent even if the tag is changed.

The information about a node is stored in plain-text files inside the node's directory. These files are as follows:

id.txt, the unique identifier. This is machine generated and shouldn't be modified.
title.txt, a single line giving the title of the node which is shown to the user
summary.txt, a 2-3 sentence summary of what the concept is and what it is used for
topics.txt, a listing of the specific topics covered by the concept node. This is mostly used for maintaining the dependency structure, and is currently not processed or shown to the user.
dependencies.txt, a list of the concept nodes that the current one directly depends on
resources.txt, a list of resources the user can consult to learn about the topic
see-also.txt, a list of pointers to related concepts
questions.txt, a list of questions for the user to think about. We are currently debating what to include here, so you can ignore it for now.

The files title.txt, summary.txt, topics.txt and questions.txt are currently treated as plain text files, but we're considering using Markdown or Textile formatting. The remaining files have a particular structure described below.

Dependencies

The file content/nodes/node_name/dependencies.txt gives a list of the concepts which a particular concept depends on. Each dependency is given as a list of field/value pairs, and the dependencies are separated by blank lines. There are three fields:

tag, the human-readable tag for the required concept
reason, the reason that concept is required. This field is optional, but it generally should be given unless it is obvious from the titles that one concept is an elaboration of the other.
shortcut, which specifies whether a shortcut can be used in lieu of the full content node. See the editing guidelines for more discussion of shortcuts and the shortcuts section for the format. The default value is 0 (false), so the only meaningful value to specify is 1 (true).

Currently, the ordering of the dependencies in the file is not used, but we are considering using it to determine in what order the concepts should be presented to the user. TODO: This will be discussed in more detail in the page for content editing guidelines.

Here is an example, for the gaussian_process_regression node:

tag: gaussian-processes

tag: bayesian-linear-regression
reason: Gaussian process regression is a kernelized version of Bayesian linear regression.

Resources

The file content/nodes/node_name/resources.txt gives a list of resources where you can learn about a concept. The list should be interpreted as "read one of the following," rather than "read all of the following."

There are some resources (such as textbooks or online courses) which are used over and over again, and others (such as individual papers) which are only used once. The former are defined in the global resources list. These may be referred to here by specifying the source field, which will pull in the default values associated with that resource. For unique resources, simply don't specify a source field, and instead specify each of the values individually.

The resources are given as lists of field/value pairs, and the resources are separated by blank lines. The following fields are conventionally specified in the node-specific resources list:

location, the location within the resource which the user should read/watch
edition, the edition number of a textbook. Currently this isn't used, but we are planning to allow resources to be added for multiple editions of a textbook, and the user can choose which one is to be displayed.
mark, an annotation for the node. Currently, the only mark is star, which indicates that the resource is well-written and fits nicely with the structure of the concept map. (Generally, we're expecting that the user would start with a starred resource, and maybe go to one of the other ones for additional clarification.)
dependencies, a comma-separated list of tags representing additional concepts that resource depends on which aren't already given by the graph structure

In addition, all of the fields listed in global resources list may be specified here as well. This is often the case for unique resources.

Here is resources.txt for the gaussian_processes node:

source: bishop
edition: 1
location: Section 6.4-6.4.2, pages 303-311
dependencies: bayesian-linear-regression
mark: star

source: murphy
edition: 1
location: Section 15.1-15.2.3, pages 515-521
dependencies: bayesian-linear-regression

source: gpml
edition: 1
location: Section 2.2, up to "Prediction with noise-free observations," pages 13-15
mark: star

source: barber
edition: 1-online
location: Section 19.1, pages 383-386
dependencies: bayesian-linear-regression

(Note: the funny edition number for the Barber book is a result of the online edition being slightly different from the printed one.)

Here is an example of a unique resource:

resource_type: paper
authors: Yann LeCun and Leon Bottou and Yoshua Bengio and Patrick Haffner
title: Gradient-based learning applied to document recognition
url: http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
free: 1
mark: star

See-also links

Finally, the file see-also.txt gives pointers to other concept nodes related to the current one. Common examples include techniques which improve on the current one, issues to watch out for, applications where the concept is used, or concepts which specialize or generalize the current one. The format will probably be Markdown or Textile; we have not decided which, and currently we're only using bulleted lists, which are the same in both formats. Additionally, each line may optionally end with the tag for the pointed-to node, in square brackets. Here is the file for gaussian_processes:

* Gaussian processes have a variety of uses in machine learning, including:
** regression [gaussian-process-regression]
** classification [gaussian-process-classification]
** black-box optimization (where we only get to evaluate the function, and doing so is expensive) [bayesian-optimization-with-gaussian-processes]
** reinforcement learning [gaussian-processes-for-reinforcement-learning]
* Techniques for constructing kernel functions [constructing-kernels]

Shortcuts

Sometimes one concept only requires understanding another at a very general level. In these cases, the solution is to add a shortcut, which is based on the original concept node, but with a reduced set of dependencies and a different set of resources. The format is simple: the shortcuts directory at the top level contains a list of subdirectories, which should be human-readable tags matching those in the nodes directory. Each shortcut subdirectory contains the files dependencies.txt, resources.txt, and optionally questions.txt, each of which overrides the corresponding file from the nodes directory and has the same format. Note that the dependencies for the shortcut node are required to be a subset of the dependencies for the original concept node. The directory should also have a topics.txt, which is not used by the server, but is there to clarify for the maintainers what topics are included in the shortcut.

Courses

A large fraction of users are likely to have already taken basic undergrad courses in subjects like linear algebra and probability theory. For subjects which are sufficiently standardized across institutions, we specify the list of concepts covered, so that those concepts can be hidden from users who specify that they've already taken the course.

Inside the courses directory is a list of subdirectories, whose names are human-readable tags analogous to the concept tags. Each of these subdirectories contains title.txt, which gives the course title which is displayed to the user, and concepts.txt, which is a listing of all the concepts covered by the course. In concepts.txt, each line is a single concept tag.

Discussion Points

Ordering dependencies sounds a bit precarious, i.e. we'd have to be very careful with write operations from the frontend/backend-utilities (such as changing dependency tags) -colorado
- Why is that? It doesn't seem like maintaining ordered lists should be much harder than maintaining unordered sets. I imagine changes to the graph structure will create more difficulties in terms of global consistency than in terms of ordering within a single file. -roger
  - It's certainly possible to maintain ordered lists; I'm just mentioning that we should be careful since "for in" loops in both python and javascript may change the ordering if we use a hash-like structure, i.e. dictionaries, to store the dependencies. I use such a hash-like structure in the front-end editor as the dependencies need a unique identifier so that changing the dependency info in the browser changes the correct dependency. -colorado

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database format

General formatting

Resources list

Concept nodes

Dependencies

Resources

See-also links

Shortcuts

Courses

Discussion Points

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally