Solve the optimization problem of tree, should we make each rectangle contains exactly one training data...












2












$begingroup$


I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:



"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."



My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?










share|cite|improve this question









$endgroup$












  • $begingroup$
    The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
    $endgroup$
    – Nick Cox
    24 mins ago
















2












$begingroup$


I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:



"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."



My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?










share|cite|improve this question









$endgroup$












  • $begingroup$
    The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
    $endgroup$
    – Nick Cox
    24 mins ago














2












2








2





$begingroup$


I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:



"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."



My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?










share|cite|improve this question









$endgroup$




I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:



"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."



My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?







machine-learning predictive-models optimization cart






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 5 hours ago









ftxxftxx

1233




1233












  • $begingroup$
    The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
    $endgroup$
    – Nick Cox
    24 mins ago


















  • $begingroup$
    The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
    $endgroup$
    – Nick Cox
    24 mins ago
















$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
24 mins ago




$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
24 mins ago










1 Answer
1






active

oldest

votes


















2












$begingroup$

You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.



This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393018%2fsolve-the-optimization-problem-of-tree-should-we-make-each-rectangle-contains-e%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.



    This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.






    share|cite|improve this answer









    $endgroup$


















      2












      $begingroup$

      You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.



      This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.






      share|cite|improve this answer









      $endgroup$
















        2












        2








        2





        $begingroup$

        You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.



        This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.






        share|cite|improve this answer









        $endgroup$



        You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.



        This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 2 hours ago









        user20160user20160

        16.8k12758




        16.8k12758






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393018%2fsolve-the-optimization-problem-of-tree-should-we-make-each-rectangle-contains-e%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Olav Thon

            Waikiki

            Tårekanal