Learning Generalized Linear Models Over Normalized Data
read more
Citations
Machine learning
Heterogeneity-aware Distributed Parameter Servers
Learning Linear Regression Models over Factorized Joins
Model Selection Management Systems: The Next Frontier of Advanced Analytics
Materialization optimizations for feature selection workloads
References
Computers and Intractability: A Guide to the Theory of NP-Completeness
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
Numerical Optimization
Machine learning
Related Papers (5)
Frequently Asked Questions (12)
Q2. What future works have the authors mentioned in the paper "Learning generalized linear models over normalized data" ?
As for future work, the authors are working on extending factorized learning to other popular algorithms to solve GLMs such as stochastic gradient descent and coordinate descent methods. Since the data access behavior of these techniques might differ from that of GLMs with BGD, it is not clear if it is straightforward to extend their ideas to these techniques.
Q3. Why is FL faster than M at higher values of both ratios?
Since the CPU cost of BGD increases with the dimension ratios, FL, which reduces the computations for BGD, is faster than M at higher values of both ratios.
Q4. How do the authors handle datasets that may not fit in memory?
By harnessing prior work from the database literature, and avoiding explicit encoding of redundancy information, the authors handle datasets that may not fit in memory.
Q5. Why does she join the two tables on the EmployerID?
She joins the two tables on the EmployerID as part of her “feature engineering” because she thinks the features of the employer might be helpful in predicting how likely a customer is to churn.
Q6. What are some of the systems that combine ML with data management platforms?
These include systems that combine linear algebra-based languages with data management platforms [4, 15, 34], systems for Bayesian inference [9], systems for graph-based ML [23], and systems that combine dataflow-based languages for ML with data management platforms [21, 22, 33].
Q7. What is the main problem of learning after joins?
From a technical perspective, the issues that arise from the redundancy present in a denormalized relation (used for learning after joins) are well known in the context of traditional relational data management [27].
Q8. What is the way to improve the accuracy of their cost model?
The authors think it is interesting future work to improve the absolute accuracy of their cost model, say, by making their models more fine-grained, and by performing a more careful calibration.
Q9. What is the stepsize parameter used to compute the loss?
The stepsize parameter (α) is typically tuned using a line search method that potentially computes the loss many times (similar to step 4) [26].
Q10. what is the i/o tradeoff between materialize and stream?
The I/O and storage tradeoffs between Materialize and Stream (Figure 1(B)) arise because it is likely that many tuples of S join with a single tuple of R (e.g., many customers might have the same employer).
Q11. What is the significance of learning over joins?
an important challenge to be addressed is if it is possible to devise approaches that learn over joins and avoid introducing such redundancy without sacrificing either the model quality, learning efficiency, or scalability compared to the currently standard approach of learning after joins.
Q12. Why did the authors use the cost model to predict the runtime trends of each approach?
Their main goal for their analytical models was to understand the fine-grained behavior of each approach and to enable us to quickly explore the relative performance trends of them all for different parameter settings.