blob: 994da3cf08d1180ebea8a3c53691222b00e2c5cf [file] [log] [blame]
#summary Tracks progress for the data seeder project.
#labels data-seeder,gsoc2010
<wiki:toc max_depth="3" />
= Introduction =
This page will track progress of the GSoC 2010 project involving a new data seeder for Melange.
= Design documents =
== Data seeder configuration sheet and fixtures ==
=== Configuration sheet ===
One of the requirements of the data seeder is to support some sort of predefined configuration so that seeding tasks are easily repeatable and customizable. A JSON file generated by the web interface seems suitable for this task. I believe JSON is better than XML due to native JavaScript support that will be used when generating the file from the web interface and also because it's more readable.
The configuration sheet will include the names of the models to seed. Parameters for the seeding will be provided via "data providers" (might think of a better name) that return different parameters (numbers, string, dates, references etc.). These data providers might follow different statistical distributions, depending on context.
A draft of the configuration sheet format might look like this:
{{{
{
"GSoCStudent" : // model name
{
"number": // number of models to seed
{
"provider_name" : "FixedInteger", // points to a data provider which simply returns an integer
"parameters" : // different data providers might have different parameters
{
"value" : 100 // configures the data provider to always return 100
}
},
"properties" : // this section will specify how to populate the properties or fields of a model
{
"given_name" : // seed the given_name property. Note that this is actually inherited from the Role model
{
"provider_name" : "NameGenerator", // points to a data provider that returns a name
"parameters" : {} // use default/no parameters for this data provider
},
"birth_date" :
{
"provider_name" : "DateNormalDistribution", // this data provider generates random dates using the specified parameters
"parameters" :
{
"mean" : "05/13/1988",
"variance" : "20m", // a variance of 20 months
"max" : "01/01/1992" // force a maximum value, for example to enforce that students are at least 18 years old
}
}
// ... other properties should go here
}
}
// ... other models should go here
}
}}}
This example shows how to configure the seeder to populate the datastore with some random data. However, it's missing one vital element, model relationship!
There are two ways to define relations for models:
# Seed all models with a reference to a specific model (either existing or new, either fixed or random etc.), as configured by a data provider. For example, we might want to seed a number of organization administrators, all referencing the same program.
# Specify how to seed related models for a specific model (indicated either by many-to-many relations via list properties or by auto-generated back-references). For example, for an organization, we might configure the seeder to add a number of student proposals for that specific organization.
*Examples:*
* Example 1
{{{
{
"StudentProject" : // model name
{
"number": { ... }, // number of models to seed
"properties" :
{
"student" :
{
"provider_name" : "RandomModelReference", // data provider that returns a random model
"parameters" :
{
"model_name" : "Student" // configure the data provider to return a random student
}
}
"mentor" :
{
"provider_name" : "FixedModelReference", // data provider that returns a specific model
"parameters" :
{
"key" : "23502935" // the key of the model to reference to. Must already exist in the datastore
}
}
}
}
}
}}}
* Example 2
{{{
{
"Student" :
{
"number": { ... }, // number of models to seed
"properties" :
{
"student_projects" : // this is the name of the back-reference property
{
"provider_name" : "RelatedModelSeeder", // data provider that seeds related models
"parameters" :
{
"model_name" : "StudentProject" // the name of the related model (this could be unnecessary as it might be auto-detected)
"configuration" : // the configuration for this data provider should follow the same rules as any model
{
"number" : { ... },
"properties" :
{
"title" : { ... },
"abstract", { ... },
...
}
}
}
}
}
}
}
}}}
This might prove flexible enough to accomplish most data seeding needs, but could perhaps be a bit too hard to configure, although configuring is a one time job, as configuration sheets will be reusable. Also, autodetecting relations might perhaps fail in case there are some "tricky" relations, not sure if this is the case. In case this proves to work successfully, changes in models will not require any changes in the data seeder at all.
=== Proposed data providers ===
* _*StringProperty*_
* *FixedStringProvider* - returns a fixed string
* *RandomWordProvider* - returns a random word from a configurable list
* *RandomNameProvider* - returns a random name
* *RandomPhraseProvider* - returns a random phrase
* _*BooleanProperty*_
* *FixedBooleanProvider* - returns a fixed boolean
* *RandomBooleanProvider* - returns a random boolean with configurable probability
* _*IntegerProperty*_
* *FixedIntegerProvider* - returns a fixed integer
* *RandomUniformDistributionIntegerProvider* - returns a random integer sampled from a discrete uniform distribution
* *RandomNormalDistributionIntegerProvider* - returns a random integer sampled from a normal distribution
* *SequenceIntegerProvider* - returns an integer from an enumerable sequence (e.g. counting objects starting from 1)
* _*FloatProperty*_
* *FixedFloatProvider* - returns a fixed float
* *RandomUniformDistributionFloatProvider* - returns a random float sampled from a discrete uniform distribution
* *RandomNormalDistributionFloatProvider* - returns a random float sampled from a normal distribution
* _*DateTimeProperty*_
* *FixedDateTimeProvider* - returns a fixed date and time
* *RandomUniformDistributionDateTimeProvider* - returns a random date and time sampled from a discrete uniform distribution
* *RandomNormalDistributionDateTimeProvider* - returns a random date and time sampled from a normal distribution
* _*DateProperty*_
* *FixedDateProvider* - returns a fixed date
* *RandomUniformDistributionDateProvider* - returns a random date sampled from a discrete uniform distribution
* *RandomNormalDistributionDateProvider* - returns a random date sampled from a normal distribution
* _*UserProperty*_
* *FixedUserProvider* - returns a fixed User
* *RandomUserProvider* - returns a random User
* _*BlobProperty*_
* *TODO:* Maybe random document formats (.doc, .pdf, .odt) or simply random binary data
* _*TextProperty*_
* Anything supported by the *StringProperty*
* *FixedTextProvider* - returns a fixed text
* *RandomParagraphProvider* - returns a random paragraph
* *RandomPlainTextDocumentProvider* - returns a random document in plain text (i.e. multiple paragraphs).
* *RandomHtmlDocumentProvider* - returns a random document in HTML
* *RandomMarkdownDocumentProvider* - returns a random document in Markdown
* _*LinkProperty*_
* *FixedLinkProvider* - returns a fixed link
* *RandomLinkProvider* - returns a random link (? should random link be valid ?)
* _*EmailProperty*_
* *FixedEmailProvider* - returns a fixed e-mail address
* *RandomEmailProvider* - returns a random e-mail address
* _*PhoneNumberProperty*_
* *FixedPhoneNumberProvider* - returns a fixed phone number
* *RandomPhoneNumberProvider* - returns a random phone number
* _*ListProperty*_
* *EmptyListProvider* - returns an empty list
==== GAE properties not used by Melange ====
* ByteStringProperty
* TimeProperty
* blobstore.BlobReferenceProperty
* CategoryProperty
* GeoPtProperty
* IMProperty
* PostalAddressProperty
* RatingProperty
=== Python fixtures ===
Python fixtures are used to easily set up and tear down a database for test purposes. The fixture module supports easy addition of static predefined data to the datastore and it's very well integrated with unit tests. The data seeder's main purpose however is to load arbitrary random data.
A local script could be used to generate a fixture dataset file from a populated database which will then be used for unit testing. This will make it easy to provide lots of generated data to tests, and in the same time ensure that the data used for tests will not change because of randomness.
== Remote vs local ==
Remote and local data seeding will be done in the same manner, using GAE's Task Queue API. However, a different solution must be implemented for generating Python fixtures for unit testing. Fixtures will be generated by making a dump of the datastore after it has been seeded using the the web interface. This way, any configuration sheet can be turned into a fixture file by seeding and dumping, but in the same time allows a user to do some manual tweaking of the data before dumping, if necessary.
== Web interface workflow ==
The web interface is the starting point for any seeding operation, either local or remote. It should allow the user to have full control on defining all parameters for the seeding operation, while maintaining flexibility for future updates. The end result should be a JSON configuration file as defined above which can be either uploaded directly to the server or downloaded to be used with the offline scripts.
=== Steps for configuring the seeding operation ===
# The user is presented with a number of models which can be seeded and selects one or more models to seed.
# For any specific model, an AJAX request is made to retrieve a list of that model's properties. For each property, an AJAX request is made to retrieve the available data providers for the property type (e.g. string, number, reference, date/time etc.)
# The user selects the number of instances to seed (this will be done by actually configuring a number data provider).
# For each model property that the user wants to seed, selecting a data provider will make an AJAX request to retrieve the parameters required to fill for that specific data provider. The user then fills in the data provider parameters (some might be optional).
* an auto-complete option for some parameters would be nice or even pretty much required in some cases (e.g. searching for an existing model).
# Reference properties are treated as a special case. Depending on the data provider, the user might be provided with a new model to seed. The user will repeat step 2 for this model.
# Available back-references for the chosen model are presented to the user. Related nested models will be configured by repeating step 2. The nested model's reference to the parent model will be ignored since it will be overwritten by the back-reference.
=== Potential problems and workarounds ===
# Some models have many-to-many references implemented as ListProperty to db.Key() internally so that the model type it references cannot be extracted.
* Possible workarounds:
* Modify the internal structure to include some metadata about the model type it references.
* Let the user manually choose which model(s) to reference in the web interface.
# A lot of models are referenced via the scope mechanism rather than using direct reference properties. There should be an easy way to configure these relations in the web interface.
=== UML Use Case diagram ===
*TODO!*
=== Mockup ===
*TODO!*
== AJAX API ==
* _*/get_data_*
* Returns a list of models (including details such as properties) that can be seeded and a list of available data providers for each property type.
* Input: N/A
* Output:
{{{
{
"models" : [
{
"name": "Student",
"description": "Student details for a specific Program.",
"parent": "models.role.Role",
"properties": [
"school_name": {
"name": "School name",
"type": "StringProperty",
"group": "5. Education",
"required": true,
"description": "..."
},
"school_type": {
"name" : "School Type",
"type": "StringProperty",
"group": "5. Education",
"required": false,
"choices": ["University", "High School"],
"description": "..."
},
...
]
},
{
"name": "Linkable",
"description": "...",
"parent": "models.base.ModelWithFieldAttributes",
"properties" : [
...
]
}
{
"name": "ModelWithFieldAttributes",
"description": "...",
"parent": null
"properties" : [
...
]
},
...
],
"providers" : {
"StringProperty": {
"FixedStringProvider": {
"name": "Fixed string providers",
"description": "Returns a fixed string.",
"parameters" : {
"value": {
"name": "Value",
"description": "The value of the string to return",
"required": true
}
}
},
"RandomWordProvider": {
"name": "Random word provider",
"description": "Returns a random word.",
"parameters" : {
"choices": {
"name": "Choices",
"description": "A list of words to choose from",
"required": false
}
}
}
...
},
"IntegerProperty" : {
...
},
...
}
}
}}}
= Architecture =
== Client-side ==
There are 2 JavaScript files containing the client-side logic:
* *app/soc/content/js/melange.seeder-??????.js*
* Keeps the metadata about models and providers (which it queries the server for) and providers convenience functions for accessing the data.
* *app/soc/content/js/templates/modules/seeder/home-??????.js*
* Does everything that happens in the web interface, using the data provided by _melange.seeder-??????,js_. This includes:
* Adding models to the web interface and configuring them
* Importing existing configuration sheets
* Building configuration sheets
== Server-side ==
* Server side code is in the *soc.modules.seeder* package
* *soc.modules.seeder.logic* - contains all the logic for the data seeder
* *logic.models* - Takes care of the model introspection part and builds the JSON data sent to the client regarding models.
* *logic.providers* - Same as _logic.models_, but handles data providers instead of models.
* *logic.seeder* - Handles configuration sheet validation and actually seeding a model given a valid configration sheet.
* *logic.mapper* - Takes care of the seeding using the MapReduce function. Defines mapper functions and input readers. Does the splitting of the full configuration sheet into smaller pieces and forwards the seeding job to _logic.seeder_
* *soc.modules.seeder.views* - Defines the views used by the seeder. Mostly just reuses code from _soc.modules.seeder.logic_.
= Integration with testing framework =
== Draft proposal ==
* Provide a function that will take care of seeding models of a type and automatically handle dependency in order to easily get some data to play with. This might be possible using a single function with the following prototype:
* *def seed(model_type, number=1, data=None, configuration_sheet=None)*
* Arguments description:
* *model_type*: The type of the model to seed.
* *number*: The number of models of the specified type to seed. Dependencies are not included in this number.
* *data*: A dictionary containing values to include in the model. For example _{link_id: "asdf"}_ or _{scope: reference_to_other_model}_.
* *configuration_sheet*: Allows specifying a full configuration sheet that will be used for seeding the models. This can provide a lot more flexibility and control in how the data and it's dependencies are seeded.
= Proposed timeline =
== Week 1 (24 May - 30 May) ==
*TODO*:
# Set up the coding environment and directory structure for the module.
# Design the API for the AJAX interaction between the client-side web interface and the server.
*Deliverables*:
# Documentation for the AJAX API.
# Skeleton structure for the module (mostly empty files).
== Week 2 (31 May - 6 June) ==
*TODO*:
# Implement data providers (except for reference data providers).
*Deliverables*:
# Implementation and tests for data providers.
== Week 3 (7 June - 13 June) ==
*TODO*:
# Implement request handlers for providing list of models and properties.
# Implement request handlers for providing list of possible data providers and parameters for a specific property.
*Deliverables*:
# Implementation and tests for request handlers.
== Week 4 (14 June - 20 June) ==
*TODO*:
# Design basic web interface and basic interaction with server (list of models, properties and data providers, no references yet).
*Deliverables*:
# Working web interface that interacts with the server and produces a valid JSON configuration sheet.
== Week 5 (21 June - 27 June) ==
*TODO*:
# Create a prototype view for seeding the database based on a configuration sheet (without using tasks API at this time). Also, no support for references at this time.
== Week 6 (28 June - 4 July) ==
*TODO*:
# Add support for adding forward and back references in the web interface.
# Update the AJAX request handlers and data providers to include information about references (if not done yet).
*Deliverables*:
# Working web interface that also includes model reference information in the configuration sheet.
# Implementation and tests for updated request handlers and data providers.
== Week 7 (5 July - 11 July) ==
*TODO*:
# Update the prototype view to also seed model references.
== Week 8 (12 July - 18 July) ==
*TODO*:
# Switch the seeding process to use the Task Queue API (seeding basic properties only, no references yet).
*Deliverables*:
# Implementation and tests for the task API seeding system.
== Week 9 (19 July - 25 July) ==
*TODO*:
# Update the task API seeding system to also seed references.
== Week 10 (26 July - 1 August) ==
*TODO*:
# Add the ability to load an existing JSON configuration sheet in the web interface and edit it.
*Deliverables*:
# Implementation and tests for the new web interface features.
== Week 11 (2 August - 8 August) ==
*TODO*:
# Make a script to dump the datastore to a Python fixture file.
# Prepare for the pencils down date :).
*Deliverables*:
# Script for dumping the database.
== Week 12 (9 August - 15 August) ==
*TODO*:
# Add any remaining finishing touches. Hopefully, everything got implemented correctly and the system is fully functioning.
= Meetings and Agendas =
== Thursday August 19, 2010 14:00 in UTC ==
===Meeting Logs===
Not available
===Meeting Notes===
Notes taken by: Felix Kerekes
* Explained to Mario how the web interface and the backend works.
* Mario did a demo seeding from scratch after explaining all the gotchas.
* Discussed about current project status and future plans.
* TODOs:
* Fill in the missing information in the wiki page.
* Make a screencast describing how the seeder works and how to use it.
* Talk to Leo about integration with the testing framework.
* Keep working! :)
== Monday August 2, 2010 14:00 in UTC ==
===Meeting Logs===
Not available
===Meeting Notes===
Notes taken by: Felix Kerekes
* Discussed plans about including some treeview navigation panel for the data seeder web interface.
== Thursday July 22, 2010 14:00 in UTC ==
===Meeting Logs===
Not available
===Meeting Notes===
Notes taken by: Felix Kerekes
* Talked about JS style and comments
== Thursday July 15, 2010 14:00 in UTC ==
===Meeting Logs===
Not available
===Meeting Notes===
Notes taken by: Felix Kerekes
* After trying a few other (some broken) templating engines, pure (http://beebole.com/pure/) is chosen as the winner :)
* Talked about midterm project status
* Talked about future project plans and how to catch up in the second half of the project
== Thursday June 24, 2010 09:00 in UTC ==
===Meeting Logs===
Not available
===Meeting Notes===
Notes taken by: Felix Kerekes
* Decided that a client-side JavaScript templating engine should be integrated
* Discussed about how the information in the web information should be presented/stored and how to build the configuration sheet from this information
== Monday June 21, 2010 19:40 in UTC ==
===Meeting Logs===
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-21.log.html#t2010-06-21T19:39:50
Meeting end: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-21.log.html#t2010-06-21T21:34:24
===Meeting Notes===
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
* Data JSON
* Currently includes some redundant information (i.e. base properties are repeated in all child)
* Refactor to only include model-specific properties without the inherited properties (the latter can be found by looking at the parent). Complicates the client-side code a bit, but inheritance should be somehow handled anyway.
* Web interface
* First, read http://jsninja.com/Overview !!!
* Extend the base Melange JavaScript scripts
* Add a switch to http://code.google.com/p/soc/source/browse/app/soc/templates/soc/base.html#109
* Adda a new param to http://code.google.com/p/soc/source/browse/app/soc/views/helper/params.py#54
* Use http://code.google.com/p/soc/source/browse/app/soc/content/js/melange.tooltip-100204.js as a skeleton for melange.seeder-??????.js (lines 1-43 and 102)
* Template
* See http://code.google.com/p/soc/source/browse/app/soc/templates/soc/organization/home.html#20 for defining a melangeContext atribute
* See http://code.google.com/p/soc/source/browse/app/soc/content/js/templates/soc/organization/home-091027.js for a JS file connected to the template
* Work on unminimized files. Minimize only when deploying.
== Tuesday June 1, 2010 09:00 in UTC ==
===Meeting Logs===
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-01.log.html#t2010-06-01T09:17:38
Meeting end: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-01.log.html#t2010-06-01T10:21:11
===Meeting Notes===
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
* AJAX API
* Making one single request at the beginning to gather all the data might be better
* Include all properties for all models, and all data providers for each property type. The UI should figure out how to link them.
* jLinq might help here: http://www.hugoware.net/Projects/jLinq
* Automatically validate each data provider parameters as it's configured in the web UI
* There user should find it easy to notice and fix errors when using an outdated or broken JSON configuration file.
== Friday May 21, 2010 14:00 in UTC ==
===Meeting Logs===
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-21.log.html#t2010-05-21T14:04:20
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-21.log.html#t2010-05-21T15:14:04
===Meeting Notes===
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
* TODOs:
* Subscribe to project feeds and melange-soc-commits group and set up some filters.
* GAE now supports local tasks. Do some research on this.
* Provide some more detail on JSON configuration and data providers.
* Things to ask on next conference call:
* Can fixtures be somehow easily exported from the datastore? Or should there be a script to generate the fixture files from scratch?
* Things to ask on the dev mailing list:
* Should the seeding be an incremetanl/progressive operation or one-step?
* Should some predefined JSON files be deployed to GAE as well?
== Wednesday May 12, 2010 8:00 in UTC ==
===Meeting Logs===
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T08:20:19
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T09:29:37
===Meeting Notes===
Location: #melange on freenode
Time: 8:00 UTC, Wednesday, 12th of May, 2010
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
* TODOs:
* Edit blog post, give credits to open source projects used for generating graphs
* Integrate the diagram generator script in the code base
* Generate .dot files with date so we can keep a history of the models
* In the future, making some kind of animation from the first commit might be really cool
* Post a detailed project schedule in the wiki, with weekly milestones
* Focus on getting the design documents done until the end of the community bonding period
* Homework for next Monday: remote/local thingy and XML/YaML/Fixtures analysis
* Conference call
* Problems regarding Melange state and timeline
* how does Melange track the current phase of a program? is it solely by date or are there any switches in the DB to reflect that ?
* Madhusudan told us that checks to the timeline are always made and showed us some code, huge thanks :)
== Tuesday May 4, 2010 13:30:00 in UTC ==
===Meeting Logs===
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T15:00:09
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T16:18:24
===Meeting Notes===
Location: #melange on freenode
Time: 15:00 UTC, Tuesday, 4th of May, 2010
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
* Meeting schedule
* IRC meetings will be held on a weekly basis, after the conference call.
* Availability during summer
* Felix will attend a part-time internship sometime around July.
* Tasks during community bonding period
* Draw an UML schema of the current data model.
* Design documents for web interface workflow.
* Further explore the possibilities of seeding locally/remotely.
* XML/YAML/Fixtures - advantages and drawbacks
* Build some prototypes to really get the feel of each solution.