blob: 4559f813da4f471978ec20455a95c05c42140a77 [file] [log] [blame]
#summary for planning the work on statistics in Melange
#labels Importance-Details,Phase-Design,Contents-Draft
_This is a wiki page. It is intended that people edit and improve it!_
<wiki:toc max_depth="3" />
= GSoC Project Info =
|| Mario Ferraro || [http://socghop.appspot.com/student_project/show/google/gsoc2009/melange/t124022698336 Ajax Statistics module] ||
|| Daniel Hans || [http://socghop.appspot.com/student_project/show/google/gsoc2009/melange/t124022698771 Statistic Module] ||
= Rough architecture overview =
We decided to go also for more flexible statistics, using [http://www.hugoware.net/ jLinq]. Doing this will affect also the way the backend works, so we probably need this kind of structure:
* * Backend *:
* _statistic.py_ will handle also JSONs which are not produced by [http://code.google.com/apis/visualization/documentation/dev/gviz_api_lib.html gviz] for immediate consumption by Google Visualization Datatable, but there will be also "raw" JSONs to tweak client side with jLinq (say, no students per age with fixed ranges, but number of students per single age, the range to be selected client side). Furthermore, it will also handle JSONs tweaked by hosts. For example, if an host wants to publish some data based on data which contains fields which are private, that JSON should not be sent on public clients in any way. Hence the reason to store them on their own with some fields cut.
* So, _statistic.py_ will handle:
# JSONs collected by jobs and "ready-to-serve" to visualizations with gviz
# JSONs collected by jobs which are to be tweaked with jLinq client side
# JSONs saved by host. For this thing to happen, _statistic.py_ need to be expanded to handle a boolean that shows if that data has been collected automatically or it's been saved by an user. This will also represent a sort of "snapshot".
* _chart.py_ should be inheriting from _document.py_ (to inherit read/write accesses and also be a Linkable), and will need basically the following fields:
* original_data (reference to _statistic.py_)
* user (reference to the user that chart belongs to)
* name_for_user (name of the saved statistic)
* jlinq_steps (something that handles the steps to reproduce filters on the data)
* chart_options (a JSON, with a fixed field that contains the type of chart and other fields which are the options selected for that chart.. colors, etc)
In some way all the charts should be called and created through a link_id, so they can be embedded in the dashboard as well as in documents (probably using an iframe).
* * Frontend: * There will be at least 5 main tasks JS should afford:
# UI handling
# Transformation of raw JSON to Google Visualization Datatable JSON
# Data tweaking using jLinq
# Chart options editing based on the type of chart requested
# Backend querying for the correct data
# A sort of "stateful" object that is coupled with each chart and that keep track of the current state, to be saved to the backend when needed.
= Gathering Data =
It would be very inefficient (and probably impossible because of GAE restrictions) to count all statistics every request. Instead of doing that, the all necessary data will be gathered periodically from the Melange data model, stats will be counted based on this data and they will be cached as [http://www.json.org JSON] objects in a new stats model. When a frontend component needs to display some statistics, it will be delivered a particular JSON object from the stats model.
Additionally, JSON objects in the stat model will be also cached in Melange memcache so as to improve effectiveness.
In order to gather data from the model Melange will be using [http://code.google.com/appengine/docs/python/config/cron.html Cron Jobs] which is a support for scheduled tasks provided by Google Application Enginge. There will be probably one Job for each statistic, but there is also a limitation that one application may have up to 20 scheduled tasks. Anyway (as Lennie suggested), we will avoid any troubles by using a single job which will fetch another url containing code for a particular statistic.
== Job Control Panel ==
* *Job Control Panel* Where you can define what stats you want to collect in what interval.
There are two possible ways to execute the gathering data jobs: they may be triggered by application engine in certain time intervals by setting in corn.yaml configuration page) or they may be executed by an administrator. The second option seems to be better, as it will allow the admin to start gathering data and making computations only for selected statistics which is important, because many of them do not change for most of the time.
Job Control Panel will be a new view that will allow an admin to pick the stats that need to be recalculated. There will be a table with all stats and next to each stat there will be a checkbox indicating whether this particular stat should be computed or not. After picking over the stat kinds, the admin will start the Job by pressing a button.
<Pawel> Ok, but also there should be a check box next to each stat to setup whether the stat should be automatically recalculated using Jobs or not in the given time interval
so basically this view will provide admin with a way of setting time intervals for stats that need them or recalculating manually selected stats</Pawel>
=== Initial Goals ===
* Set a job running to gather *location stats for accepted proposals*.
* Set a job to gather *location stats for submitted proposals*.
* Results will be *raw JSON* (at this stage).
_Need to decide precise format of this JSON data, (for clarity). An example here would help enormously._
According to [http://code.google.com/apis/visualization/documentation/using_overview.html#embeddingavis what Google Visualization asks for], something roughly like:
{{{
{
"title": "Title of graph",
"columns": [
{
"type": "string",
"name": "Country"
},
{
"type": "number",
"name": "N. Students"
}
],
"rows": [
{
"rowdata": [
"Country 1",
30
]
},
{
"rowdata": [
"Country 2",
40
]
}
]
}
}}}
In this way we can iterate through it as:
{{{
var data = new google.visualization.DataTable();
for (var index in json.columns) {
data.addColumn(json.columns[index].type, json.columns[index].name);
}
data.addRows(json.rows.length);
for (var row_index in json.rows) {
for (var column_index in json.rows[row_index].rowdata) {
data.setValue(parseInt(row_index), parseInt(column_index), json.rows[row_index].rowdata[column_index]);
}
}
}}}
* Working example, cut and paste the following code into [http://code.google.com/apis/ajax/playground/?type=visualization#image_pie_chart AJAX APIs Playground]
{{{
function drawVisualization() {
var json = {
"title": "Title of graph",
"columns": [
{
"type": "string",
"name": "Degree"
},
{
"type": "number",
"name": "N. Students"
}
],
"rows": [
{
"rowdata": [
"Undergraduate",
70
]
},
{
"rowdata": [
"Master",
60
]
},
{
"rowdata": [
"PhD",
30
]
}
]
};
var data = new google.visualization.DataTable();
for (var index in json.columns) {
data.addColumn(json.columns[index].type, json.columns[index].name);
}
data.addRows(json.rows.length);
for (var row_index in json.rows) {
for (var column_index in json.rows[row_index].rowdata) {
data.setValue(parseInt(row_index), parseInt(column_index), json.rows[row_index].rowdata[column_index]);
}
}
new google.visualization.ImagePieChart(document.getElementById('visualization')).draw(data, null);
}
}}}
==== Statistics provisional data model ====
{{{
from google.appengine.ext import db
import soc.models.linkable
class StatisticJSON(soc.models.linkable.Linkable):
"""Model used to store a JSON representation of statistics
"""
#: internal String ID to get the correct statistics
stat_id = db.StringProperty(required=True)
#: name of the Statistic
stat_name = db.StringProperty()
#: long description of the Statistics
stat_description = db.TextProperty()
#: partial JSON representation, to be filled during the gathering of data
working_json_string = db.TextProperty()
#: percentual of completeness during gathering of data
percentual_complete = db.IntegerProperty()
#: final JSON representation, to be filled when data gathering is completed
final_json_string = db.TextProperty()
#: date when this calculation was last updated
calculated_on = db.DateTimeProperty(auto_now=True)
}}}
==== Adding new stats ====
If we want to add a new statistic that takes data only from one model, the first thing we need to do is to add it to the data model.
Of course we have to create a new Statistic entity. Let's say that we want to add Mentors Per Coutry stat, thus the new entity has properties like:
{{{
properties = {
'id' : 'MentorsPerCountry',
'name' : 'Mentors Per Country',
'link_id' : 'mentors_per_country',
'scope' : program_entity, # each statistic binded with a program
'records_completed' : 0,
'scope_path' : 'google/gsoc'
}
}}}
The important field for later is id. Now, when we refresh statistics list page, we see the new one, but clicking 'Collect' causes an exception. Module that is responsible for updating statistics is {{{soc.logic.models.statistic}}}. The first function that is invoked is {{{updateStat}}}, but it is just a dispatcher which calls a right function depending on id. Therefore, if id is {{{MentorsPerCountry}}}, we have to provide function {{{updateMentorsPerCountry(self, scope)}}} (or update+id in general).
We may want to write a custom updating function, but if we want to get data only from one model, it may be faster to use {{{updateStatEntity}}} (the name will probably change) function, but we need to pass some arguments. It is easier to understand it by an example, so let's look again at our MentorsPerCountry example.
To get {{{statistic}}} entity, we have to know two fields that determines it unambiguously. Those fileds are for example id and scope.
We deals with mentors, so all data is collected based on mentor model. In order to get entities from that model, we have to provide logic, so {{{logic=soc.logic.models.mentor}}}.
The next thing is a list of {{{choices}}} for our stat. In this example it's a list of all countries.
The last but certainly not least param is {{{choice_selector}}}. It is a function that takes params as its argument. This function should simply take an entity and return its choice (the entity is passed to this function by {{{params['entity']}}}). For example, if there is a mentor whose {{{res_coutry}}} is Poland, the choice_selector should return 'Poland'. In our example (when the choice is explicitly in model), we may use {{{defaultChoiceSelector}}}, but sometimes the right choice has to be calculated (see {{{studentsPerAgeChoiceSelector}}})
=== Later Goals for Job Control ===
* _...insert text here_
=== Stats ===
* Students per country/continent
* Students per school/continent
* Mentors per country
* Amount of mentors/students/org admins
* Server activity (e.g. big spike near deadlines).
* Organizations per program
* Application per organization / program
* Students, mentors and organization admins ages
* Connections between students and their mentors
* Dependencies between the fact that a student fails to complete his project and the distance between the student and his mentor
* Dependencies between a number of slots per organization and completion rate for this organization
* Dependencies between projects completion rate and student ages (or academic years)
* Standard statistc values (arithmetic mean, median, quadrille, etc)
=== Final stats ===
* Students (with/without project/all) per country/continent
* Students (with/without project/all) per school/continent
* Mentors (with/without project/all) per country/continent
* Students (with/without project/all) by degree
* Students (with/without project/all) by major
* Students (with/without project/all) by expected graduation year
* Number of Org Admins that are also Mentors
* Average number of Students per Mentor
* Average number of Student Proposals per Student
* Student Proposals number per country
* Student Projects per country
* Amount of mentors/students/org admins/mentors with projects/students with projects/students with proposals
* Organizations per program
* Student Projects/Student Proposals per organization / program
* Students, mentors and organization admins ages
= Displaying Stats =
== Stats Display Pages ==
=== Initial Goals ===
* Just raw JSON
=== Later Goals for Displaying Stats ===
* _...insert text here_
== Services/Libraries Used ==
* *[http://code.google.com/apis/chart/ Google Chart API]
*
* *[http://code.google.com/apis/visualization/documentation/gallery.html Google Visualization API]*
* *[http://www.clabberhead.com/googlechartgenerator.html Google Charts Generator]* (might be helpful for quickly testing Chart API output)
== Animated Displays ==
_Mario: What's the simplest change to the JSON data collected that would allow us to show one of these?_
Something like
{{{
{
[
{
"timestamp": timestamp1,
"caption": "If we want an additional description",
"data": JSONObject
},
{
"timestamp": timestamp2,
"caption": "Additional description 2",
"data": JSONObject
}
]
}
}}}
should suit our needs at a first glance. If _timestamp_ is not present we can consider the Array of JSONObjects (being _JSONObject_ a JSON object as defined formerly) as a simple order.
* *Drag slider* to see how profile changes over time, e.g. on a map.
* *Timestamped data* may be all that we need.
= IRC Meetings =
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-07-09.log.html#t2009-07-09T06:17:26 Setting up priorities with Pawel] 9 July 2009 ==
* Agenda *
# Status update
# Asynchronous requests problem
# Task Queue API wiki update
# Priorities for the statistic module
* Notes *
# Daniel worked on the dashboard page to make the widgets to be loaded on their positions that are saved the server. Anyway, he encountered some problems, because widgets were loaded in the following way: a request was send to get JSON list of all charts, then for each chart there was a request to get its actual data. The problem was that the requests were asynchronous so that responses came in random order. A few solutions came up with a few solutions during the conversation, but we decided to go for changing async requests to sync ones.
# Pawel suggested making some changes in [http://code.google.com/p/soc/wiki/TaskQueueAPI wiki page for Task Queue API]
# Daniel and Mario talked with Pawel about priorities for statistic module in the second part of Google Summer of Code. Pawel wanted the developers to send emails with their priorities and plans for the upcoming week. Daniel was also requested to update blog.
# In the upcoming days Mario is going to have less time than he expected due to his exams. He was also requested to send his time availability in the next days.
# Mario talked with Pawel about how much we should focus on creating the dashboard page. Pawel stated that it was very important, but it is also important to have everything work stable before introducing for example new JavaScript plugins.
# Mario and Daniel should start sending their first patches to the main repository (for reviews and later commits) as soon as possible.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-29.log.html#t2009-06-29T11:00:02 Getting Daniel into JavaScript code] 29 June 2009 ==
* Agenda *
# Status update
# Task assignment and priorities for this week
# Goals for the midterm evaluation
# JavaScript tutorial by Mario for Daniel
* Notes *
# Mario integrated the thickbox with the menu and the list of statistic is taken from JSON response from the server - he encountered some problems at the beginning, but fortunately it works now.
# Daniel finally added the functionality that statistics can be collected using Task Queue API, so now a user does not have to collect them batch by batch. It hasn't been completely tested yet, but it seemed to work on his development instance.
# Daniel also started writing a script similar which is supposed to work like seed_db, but to put large batches of data on to real instances.
# Daniel and Mario scheduled the most important tasks for the upcoming week. They also prioritized their goals for the midterm evaluation. This week they are mainly going to work on the dashboard page and integrate the Python code with the JavaScript code.
# Mario explained Daniel how the JavaScript skeleton is designed and how to actually use it when writing new parts of code.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-28.log.html#t2009-06-28T10:01:03 Sunday's meeting] 28 June 2009 ==
* Agenda *
# Status update
* Notes *
# Mario is still working on his previous tasks, because on Saturday he worked on [http://code.google.com/p/soc/issues/detail?id=645&sort=-id&colspec=ID%20Type%20Priority%20DueBy%20Status%20Owner%20Milestone%20Effort%20Skills%20Summary Issue 645]. Daniel is finishing working on adding Task Queue API for statistics collection so that they will not have to be collected in small batches.
# Daniel is wondering if seed_db or something similar can be used to put large batches of data to a melange instance on actual google app engine server. It would be useful because Task Queue API can be better tested on 'real' instances. Because we do not know anything that could be useful, Daniel will try a script which creates a lot of data using remote_api
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-27.log.html#t2009-06-27T05:45:56 Working on dashboard] 27 June 2009 ==
* Agenda *
# Status update
# Task assignment
* Notes *
# Mario pushed an initial version of statistic dashboard page. Daniel got to know the new parts of code for both Java Script and Python sides of the statistic module.
# Mario is going to work connecting JQuery side of the module with the Python side (JQuery has to retrieve data from JSON responses) and after that he is going to work on saving dashboard page status for each user.
# Daniel is going to work on introducing [http://code.google.com/appengine/docs/python/taskqueue/ Task Queue API] so as to automatize stat collecting process and is going to write his first Java Script path which will fixed a bug on a dashboard page.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-23.log.html#t2009-06-23T10:19:30 Short meeting] 23 June 2009 ==
* Agenda *
# Status update
# Wiki update
# Plans for tomorrow
* Notes *
# Mario updated the [http://melange-dev.blogspot.com/2009/06/gsoc09-statistics-module-3rd-weekly.html Melange blog]. He also set up his Win box in order to debug the organization home map. Daniel finished working on sending JSON responses which will be requested by the JS side and integrated by Mario.
# Daniel is updating the wiki with some new architecture descriptions by tomorrow.
# Tomorrow we are going to work on creating the dashboard page and integrating our code (for example JSON responses). Mario is also going to introduce Daniel to the Java Script / Ajax side of our project so that Daniel will be able to work on this side too.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-22.log.html#t2009-06-22T09:51:23 Dashboard] 22 June 2009 ==
* Agenda *
# Status update
# Dashboard
# JSON Response
# Next steps
* Notes *
# Mario has worked on integrating YUITest and porting former JSUnit tests to YUITest. Worked on integrating Selenium RC and committed an example test. Daniel finished dividing stats collection into two parts (choices and actual gathering) and added support for default views of statistic model (create, list, delete, edit). Now is starting learning Task queue API
# A dashboard.py model is needed to save current status of dashboard (like which charts/statistics and where the widgets are located). Mario explained Daniel the overall vision of the architecture and how statistic, chart and dashboard models will work together.
# Mario has shown Daniel where in the actual code is already done something similar (get parameters from JS and respond with a JSONResponse)
# We will work on creating the dashboard (model, logic, view and javascript) and to get JSONResponse out of the statistic model.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-19.log.html#t2009-06-19T09:59:34 Issue tracker] 19 June 2009 ==
* Agenda *
# Status update
# Issue tracker and planning
# Next steps
* Notes *
# Mario has explored jquery inheritance plugin, found it not worthy for the moment. Then he pushed in the demo instance an example of transclusion, and worked on integrating YUITest (to be pushed today), because jsUnit seems not to be capable of testing asynchronous events. This morning he set up bitbucket Issue Tracker. Daniel has worked on gathering data from more than one model (but he hadn't put any params in the model, as we're waiting for some reply from mentors), and he's currently working on separating "choice selectors" and actual gathering of data for that task (for example, number of applications per student need to collect first all students link_ids and then gather applications data, otherwise we will not include students with 0 applications)
# As we've now an issue tracker in place, Mario has already put some issues there (and version goals), and asked Daniel to do the same ASAP so we can ask for ok our mentors about next steps.
# Daniel will be finishing in the weekend his current task, Mario today is going to work on pushing YUITest and porting JSUnit tests to YUITest as well. As soon as all tasks are in place in the issue tracker we will ask mentors for their ok about next week work. We'll discuss it this evening or Sunday. Saturday meeting it's not going to happen as Mario won't be at home nearly all day. So next meeting on Sunday :)
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-18.log.html#t2009-06-18T10:05:31 Generalising functions] 18 June 2009 ==
* Agenda *
# Status update
# Entity Per Field generalization
# Planning meeting for next week plan
# Next steps
* Notes *
# Mario worked on externalising Javascript code from Django templates but found issues with the json2 library, just ignored and workarounded it with eval() for the moment, example code pushed in the repository. Daniel worked on the skeleton to gather data from multiple models, but found some issues. We decided to release early and just create a "applications per student" statistic, then work on something similar and find common patterns afterwards
# About that topic, we're waiting for Sverre or someone else to join the discussion in the dev list
# We're going to decide tomorrow when to discuss next tasks (no later than Monday morning), priorities and task assignments to begin working on each other's code (Mario doing some work on Python, Daniel doing some work on Javascript). The meeting will have, as "deliverable" a list of tasks/priorities/assignments to be corrected by our mentors if something is wrong with it.
# Daniel is going to implement applications per student statistic, Mario is going to work on integrating (if it worths) jquery inheritance plugin and fixing (if possible) asynchronous testing issues with jsUnit
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-17.log.html#t2009-06-17T09:21:12 Getting Mario into Python code] 17 June 2009 ==
* Agenda *
# Mario couldn't get his statistics local instance to work
# Abstracting updateEntityPerField functions: now there is a function for every type of gathering data.
# Next steps
* Notes *
# Daniel explained Mario to seed the db with a script Daniel has put into scripts/ directory. Mario thought that that script should be put inside seed_db. Mario worked on it (first Python code :)) and got it done successfully.
# Mario thinks they could be abstracted into a single function, putting some options inside the data model and so let it be parametric. Daniel thought to ask a committer first, Lennie joined the discussion and was doubtful, requested to ask Sverre about it. Mario needs to send an email to discuss about it.
# Mario will work in integration between his Javascript and the Python code (done and pushed in the demo instance), Daniel is working on stats that gather data from more than one model. Mario is going to work on abstracting his Javascript code to change the visualization dynamically and some work on letting jsUnit work with asynchronous tests.
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-06-16.log.html#t2009-06-16T05:07:45 Sync meeting] 16 June 2009 ==
* Agenda *
# New [StatisticsModule#Rough_architecture_overview python architecture] as discussed between Mario and James Crook.
# How to keep in sync from now on
# Setting up the demo instance
# Next steps
* Notes *
# Mario explained the architecture to Daniel (why and how)
# Planned daily IRC meeting: everyday at 14 UTC (15 Mario timezone, 16 Daniel timezone)
# Daniel set up the demo instance to be used also to normal users, then explained Mario how to deploy the code to the demo instance
# Mario will be working on abstracting the current "getDataTable + statistic.id" functions
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-05-28.log.html#t2009-05-28T12:15:06 Spin off meeting part 2] 28 May 2009 ==
* Agenda *
# Memcache issue
# JS layer architecture
* Notes *
# Daniel already fixed it
# Mario proposed to use [http://code.google.com/apis/visualization/documentation/dev/gviz_api_lib.html gviz library]
== [http://www.thousandparsec.net/~irc/logm/%23melange.2009-05-27.log.html#t2009-05-27T10:03:52 Spin off meeting part 1] 27 May 2009 ==
* Agenda *
# What Mario and Daniel are doing and what we're going to do
# What can be done to join code, without sending and applying patches by mail
# Getting Mario start to join Daniel's code
* Notes *
# Mario: studying JS architecture. Daniel: going to fix memcache issue
# Mail sent to Pawel and Sverre to ask what to do (we will work on Daniel's repository)
# Daniel helped Mario to understand his code
== [http://www.thousandparsec.com/~irc/logm/%23melange.2009-05-07.log.html Stats Meeting on IRC] 7 May 2009 ==
* Key points:* Data will be collected by Jobs. This is the first step. Accordingly the [http://code.google.com/p/soc/issues/list?can=1&q=Issue+357&sort=id&colspec=ID+Summary 1000 record issue] is not a problem. Data will be presented as raw JSON (at this stage). Will be stored in GAE Model as a JSON string AND again as a memcache object. When you request the stats view you should get it right away. Loading from Jobs is very light. Although not ideal we could afford to collect stats all the time, even when there are no changes, though we intend to do better than that right from the start. Animated graphs may need strings collected at different times and/or timestamps in the data. Daniel and Mario to use this wiki to progress the Stats projects.
= Background Links =
* *[http://spreadsheets.google.com/ccc?key=p6DuoA2lJToKmUzoSq6raZQ Spreadsheet]* - With several pages of reports/figures showing changes from year to year.
* *[http://google-opensource.blogspot.com/2008/04/two-top-10s-for-google-summer-of-code.html GSoC by location]* - A classic use of GSoC statistics. Shows that mixing stats and text on same page will be important.
* *[http://google-opensource.blogspot.com/2008/05/this-weeks-top-10s-universities-for.html GSoC by University]* - A gospo blog post of a similar nature.
* *[http://opentouch.info/tmp/ghop/ghop-stats.html GHOP Stats]* - Gospo page with GHOP stats.
= Who Does What? =
* LH has warned about student *projects that depend on work by other students*. It's important that neither student is ever held up 'waiting' for work by the other student. In the extreme case, if one student drops out, we don't want to end up with neither project succeeding.
* *End-to-end knowledge:* We're starting with the backend as being essential to getting anything. For project success, both students will need to know both front and backend. Both will be extending the range of tools and language features that they know.
* A conflicting goal: We also want to *avoid duplication of work*, with both students doing the same things at the same time - a particular risk near the start.
* The solution is *communication*. It's for Mario and Daniel to work out exactly how this will work, exactly how the work will be divided. To avoid duplications, they need to be teaching each other what they learn in developing their areas of Melange as they go along. Probably three way code reviews (one mentor involved), probably sharing of status reports, probably some other things too... This all needs to be worked out between Mario and Daniel with mentors looking on. Wiki should help as key information can be put here.
== Spin Offs ==
* With two strong students on Stats we have more manpower than we strictly need for stats at this stage of Melange's development. Therefore, once the projects are started and on-track, we should be looking at spin offs that are motivated by the stats needs, that are for stats, but that happen to progress other important aspects of Melange too. Examples are in the *job scheduling*, we will be using the jobs for other tasks too, and refinements in jobs will benefit these. In presentation the *javascript* for configuring graphs will likely do double duty and be used in other places too. Lots of opportunities for creative flexible design. Details will need discussing with mentor(s) as projects progress.
== Generic Features ==
* This is a table of new generic features that have wider use, each from the point of view of stats.
|| *Feature* || *Description* || *Uses* || *Python-Model* || *Python-Logic* || *HTML Interface* || *Javascript* ||
|| *Access control* || For stats we're interested in per entity access control similar to what happens in documents. Both chart and collected statistics have per entity rights. || Code from document.py || Mixin. Stores entity rights. || Yes || Choice || No ||
|| *Taggability* || Ability to tag data in other tables. For stats we will need to query the tags relating to a table. || Taggable.py || Mixin. Stores the tags. || Yes || Delicio.us like || Yes ||
|| *jLinq Filtering* || AJAXY flexible filtering and reduction of tabular data. Done client side for responsiveness. || jLinq || Yes. Stores filter || No || Complex || Integration of jLinq ||
|| *Jobs* || General facility to do some background activity at some time. Jobs are broken up into smaller chunks. Has a control panel to control what is done and when and show current status or progress of jobs. || Existing foundations || Yes || Yes || Tabular || No ||
|| *Expando* || Ability to gather stats on an 'expando' table. We need to do this to get stats from a survey || Work done for Surveys || Yes || Yes || As for normal stats || As for normal stats ||
|| *Transclusion* || Ability to include another displayable object within a document. We'll use this to embed a chart in a document just like including an image. || A regexp || No || regexp to add iframe || {{{<iframe>}}} || Minimal (delayed load frame) ||
* As Melange progresses we are likely to identify more features which are generic. The idea of a dashboard may be one such. For now we are coding dashboards on an as-needed basis.
= Other Details =
== Transclusion And Graphing Syntax ==
* We'd like to be able to embed charts and graphs in a page of explanation and commentary. We'd have a URL for the raw 'graph', and we'd provide parameters to that on that page. This will have permissions just like documents do. Then in a document we transclude the named chart using something like *$Pie(!ChartName)*. We can have more than one chart on a page.
_...More details_
== Test ==
* To test stats requires upgrades to seed_db.py. We need random variation in the values. We'll need bias - some categories more likely than others - so that not all sectors in our pie charts will be the same size. We'll need correlations - so that when we look for differences in success rates based on how many years the mentoring organisation has participated we'll find them.
* For creating correlated stats the suggested method is to generate far fewer random numbers than the number of fields to fill, and to then combine these numbers to generate the random field values. The correlations don't have to be particularly real-world realistic. It's enough that some correlations exist so that we can test the stats methods.
== Tagging ==
_Discussion of *[TagsInMelange tagging]* and how it fits into statistics. E.g will want pie charts based on tags._
<Pawel>yes this is something we want for GHOP for sure.</Pawel>
= Questions =
*Q:* This year many orgs are by choice mentoring fewer students and adding tougher qualification requirements. We're interested in whether mentor:student ratios have increased and what other changes may go with this. Are there questions we should ask *LH to ask for mid term survey* that would help here?
*A:*
*Q:* Many stats are possible. What are the *'big win' stats*, the ones that are most likely to tell us how to improve GSoC?
*A:*
*Q:* *GHOP stats:* What stats are specific to GHOP and not part of GSoC? _Answers will end up on *[GHOP]* page (to coordinate work with that project)._
*A:* Just have a look at my last year *[http://opentouch.info/tmp/ghop/ghop-stats.html GHOP Statistics page]* and you should get an idea what kind of stats are GHOP specific.