This page will track progress of the GSoC 2010 project involving a new data seeder for Melange.
One of the requirements of the data seeder is to support some sort of predefined configuration so that seeding tasks are easily repeatable and customizable. A JSON file generated by the web interface seems suitable for this task. I believe JSON is better than XML due to native JavaScript support that will be used when generating the file from the web interface and also because it's more readable.
The configuration sheet will include the names of the models to seed. Parameters for the seeding will be provided via “data providers” (might think of a better name) that return different parameters (numbers, string, dates, references etc.). These data providers might follow different statistical distributions, depending on context.
A draft of the configuration sheet format might look like this:
{
"GSoCStudent" : // model name
{
"number": // number of models to seed
{
"provider_name" : "FixedInteger", // points to a data provider which simply returns an integer
"parameters" : // different data providers might have different parameters
{
"value" : 100 // configures the data provider to always return 100
}
},
"properties" : // this section will specify how to populate the properties or fields of a model
{
"given_name" : // seed the given_name property. Note that this is actually inherited from the Role model
{
"provider_name" : "NameGenerator", // points to a data provider that returns a name
"parameters" : {} // use default/no parameters for this data provider
},
"birth_date" :
{
"provider_name" : "DateNormalDistribution", // this data provider generates random dates using the specified parameters
"parameters" :
{
"mean" : "05/13/1988",
"variance" : "20m", // a variance of 20 months
"max" : "01/01/1992" // force a maximum value, for example to enforce that students are at least 18 years old
}
}
// ... other properties should go here
}
}
// ... other models should go here
}
This example shows how to configure the seeder to populate the datastore with some random data. However, it's missing one vital element, model relationship!
There are two ways to define relations for models:
Examples:
{
"StudentProject" : // model name
{
"number": { ... }, // number of models to seed
"properties" :
{
"student" :
{
"provider_name" : "RandomModelReference", // data provider that returns a random model
"parameters" :
{
"model_name" : "Student" // configure the data provider to return a random student
}
}
"mentor" :
{
"provider_name" : "FixedModelReference", // data provider that returns a specific model
"parameters" :
{
"key" : "23502935" // the key of the model to reference to. Must already exist in the datastore
}
}
}
}
}
{
"Student" :
{
"number": { ... }, // number of models to seed
"properties" :
{
"student_projects" : // this is the name of the back-reference property
{
"provider_name" : "RelatedModelSeeder", // data provider that seeds related models
"parameters" :
{
"model_name" : "StudentProject" // the name of the related model (this could be unnecessary as it might be auto-detected)
"configuration" : // the configuration for this data provider should follow the same rules as any model
{
"number" : { ... },
"properties" :
{
"title" : { ... },
"abstract", { ... },
...
}
}
}
}
}
}
}
This might prove flexible enough to accomplish most data seeding needs, but could perhaps be a bit too hard to configure, although configuring is a one time job, as configuration sheets will be reusable. Also, autodetecting relations might perhaps fail in case there are some “tricky” relations, not sure if this is the case. In case this proves to work successfully, changes in models will not require any changes in the data seeder at all.
Python fixtures are used to easily set up and tear down a database for test purposes. The fixture module supports easy addition of static predefined data to the datastore and it‘s very well integrated with unit tests. The data seeder’s main purpose however is to load arbitrary random data.
A local script could be used to generate a fixture dataset file from a populated database which will then be used for unit testing. This will make it easy to provide lots of generated data to tests, and in the same time ensure that the data used for tests will not change because of randomness.
Remote and local data seeding will be done in the same manner, using GAE's Task Queue API. However, a different solution must be implemented for generating Python fixtures for unit testing. Fixtures will be generated by making a dump of the datastore after it has been seeded using the the web interface. This way, any configuration sheet can be turned into a fixture file by seeding and dumping, but in the same time allows a user to do some manual tweaking of the data before dumping, if necessary.
The web interface is the starting point for any seeding operation, either local or remote. It should allow the user to have full control on defining all parameters for the seeding operation, while maintaining flexibility for future updates. The end result should be a JSON configuration file as defined above which can be either uploaded directly to the server or downloaded to be used with the offline scripts.
TODO!
TODO!
{
"models" : [
{
"name": "Student",
"description": "Student details for a specific Program.",
"parent": "models.role.Role",
"properties": [
"school_name": {
"name": "School name",
"type": "StringProperty",
"group": "5. Education",
"required": true,
"description": "..."
},
"school_type": {
"name" : "School Type",
"type": "StringProperty",
"group": "5. Education",
"required": false,
"choices": ["University", "High School"],
"description": "..."
},
...
]
},
{
"name": "Linkable",
"description": "...",
"parent": "models.base.ModelWithFieldAttributes",
"properties" : [
...
]
}
{
"name": "ModelWithFieldAttributes",
"description": "...",
"parent": null
"properties" : [
...
]
},
...
],
"providers" : {
"StringProperty": {
"FixedStringProvider": {
"name": "Fixed string providers",
"description": "Returns a fixed string.",
"parameters" : {
"value": {
"name": "Value",
"description": "The value of the string to return",
"required": true
}
}
},
"RandomWordProvider": {
"name": "Random word provider",
"description": "Returns a random word.",
"parameters" : {
"choices": {
"name": "Choices",
"description": "A list of words to choose from",
"required": false
}
}
}
...
},
"IntegerProperty" : {
...
},
...
}
}
There are 2 JavaScript files containing the client-side logic:
Provide a function that will take care of seeding models of a type and automatically handle dependency in order to easily get some data to play with. This might be possible using a single function with the following prototype:
def seed(model_type, number=1, data=None, configuration_sheet=None)
TODO:
Deliverables:
TODO:
Deliverables:
TODO:
Deliverables:
TODO:
Deliverables:
TODO:
TODO:
Deliverables:
TODO:
TODO:
Deliverables:
TODO:
TODO:
Deliverables:
TODO:
Deliverables:
TODO:
Not available
Notes taken by: Felix Kerekes
Not available
Notes taken by: Felix Kerekes
Not available
Notes taken by: Felix Kerekes
Not available
Notes taken by: Felix Kerekes
Not available
Notes taken by: Felix Kerekes
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-21.log.html#t2010-06-21T19:39:50
Meeting end: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-21.log.html#t2010-06-21T21:34:24
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-01.log.html#t2010-06-01T09:17:38
Meeting end: http://www.thousandparsec.net/~irc/logm/%23melange.2010-06-01.log.html#t2010-06-01T10:21:11
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-21.log.html#t2010-05-21T14:04:20
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-21.log.html#t2010-05-21T15:14:04
Location: #melange on freenode
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T08:20:19
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T09:29:37
Location: #melange on freenode
Time: 8:00 UTC, Wednesday, 12th of May, 2010
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes
Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T15:00:09
Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T16:18:24
Location: #melange on freenode
Time: 15:00 UTC, Tuesday, 4th of May, 2010
Chaired by: Mario Ferraro
Notes taken by: Felix Kerekes