Why is computing ridge regression with a Cholesky decomposition much quicker than using SVD?

By my understanding, for a matrix with n samples and p features:

Ridge regression using Cholesky decomposition takes O(p^3) time

Ridge regression using SVD takes O(p^3) time

Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

I tested this out in scipy on both random and real-world data with p > n (p = 43624, n = 1750) and found ridge regression with a Cholesky decomposition to be much quicker than computing it using SVD and much quicker than just computing the diagonal matrix of the SVD. Why is this?

import numpy as np

from sklearn import linear_model

import scipy

import time



x = np.random.rand(1750, 43264)

y = np.random.rand(1750)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='cholesky')

clf.fit(x, y)

print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='svd')

clf.fit(x, y)

print("Time taken to solve Ridge with SVD: ", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False)

print("Time taken for SVD", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False, compute_uv=False)

print("Time taken for SVD, just s", time.time() - old_time)

Output:

Time taken to solve Ridge with cholesky:  3.2336008548736572

Time taken to solve Ridge with SVD:  118.47378492355347

Time taken for SVD 92.01217150688171

Time taken for SVD, just s 44.7129647731781

asked 4 hours ago

pighead10

1264

New contributor

$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
3 hours ago

add a comment |

By my understanding, for a matrix with n samples and p features:

Ridge regression using Cholesky decomposition takes O(p^3) time

Ridge regression using SVD takes O(p^3) time

Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

import numpy as np

from sklearn import linear_model

import scipy

import time



x = np.random.rand(1750, 43264)

y = np.random.rand(1750)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='cholesky')

clf.fit(x, y)

print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='svd')

clf.fit(x, y)

print("Time taken to solve Ridge with SVD: ", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False)

print("Time taken for SVD", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False, compute_uv=False)

print("Time taken for SVD, just s", time.time() - old_time)

Output:

Time taken to solve Ridge with cholesky:  3.2336008548736572

Time taken to solve Ridge with SVD:  118.47378492355347

Time taken for SVD 92.01217150688171

Time taken for SVD, just s 44.7129647731781

asked 4 hours ago

pighead10

1264

New contributor

$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
3 hours ago

add a comment |

By my understanding, for a matrix with n samples and p features:

Ridge regression using Cholesky decomposition takes O(p^3) time

Ridge regression using SVD takes O(p^3) time

Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

import numpy as np

from sklearn import linear_model

import scipy

import time



x = np.random.rand(1750, 43264)

y = np.random.rand(1750)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='cholesky')

clf.fit(x, y)

print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='svd')

clf.fit(x, y)

print("Time taken to solve Ridge with SVD: ", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False)

print("Time taken for SVD", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False, compute_uv=False)

print("Time taken for SVD, just s", time.time() - old_time)

Output:

Time taken to solve Ridge with cholesky:  3.2336008548736572

Time taken to solve Ridge with SVD:  118.47378492355347

Time taken for SVD 92.01217150688171

Time taken for SVD, just s 44.7129647731781

asked 4 hours ago

pighead10

1264

New contributor

By my understanding, for a matrix with n samples and p features:

Ridge regression using Cholesky decomposition takes O(p^3) time

Ridge regression using SVD takes O(p^3) time

Computing SVD when only the diagonal matrix is needed (and not u and v) takes O(np^2) time

import numpy as np

from sklearn import linear_model

import scipy

import time



x = np.random.rand(1750, 43264)

y = np.random.rand(1750)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='cholesky')

clf.fit(x, y)

print("Time taken to solve Ridge with cholesky: ", time.time() - old_time)



old_time = time.time()

clf = linear_model.Ridge(alpha=1.0, solver='svd')

clf.fit(x, y)

print("Time taken to solve Ridge with SVD: ", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False)

print("Time taken for SVD", time.time() - old_time)



old_time = time.time()

scipy.linalg.svd(x, full_matrices=False, compute_uv=False)

print("Time taken for SVD, just s", time.time() - old_time)

Output:

Time taken to solve Ridge with cholesky:  3.2336008548736572

Time taken to solve Ridge with SVD:  118.47378492355347

Time taken for SVD 92.01217150688171

Time taken for SVD, just s 44.7129647731781

regression ridge-regression svd scipy cholesky

asked 4 hours ago

pighead10

1264

New contributor

asked 4 hours ago

pighead10

1264

New contributor

asked 4 hours ago

pighead10

1264

New contributor

asked 4 hours ago

pighead10

1264

asked 4 hours ago

pighead10

1264

New contributor

pighead10 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
3 hours ago

add a comment |

$begingroup$
Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.
$endgroup$
– ttnphns
3 hours ago

Cholesky decomposes a square symmetric matrix (such as correlations) and is much faster itself than SVD which, in general, decomposes a rectangular matrix (such as dataset) into 3 matrices.

– ttnphns
3 hours ago

add a comment |

1 Answer
1

active

oldest

votes

First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $mathbf{X}$, so the new matrix $mathbf{X}_lambda$ has dimensions $ (n + p) times p $.

We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.

We see that relevant costs for the non-ridge case, assuming $mathbf{X}$ is $n times p$, are:

$$ C_{LU} = 2 n p^2 - frac{2}{3} p^3 tag{1} $$

$$ C_{SVD} = 4 n p^2 + 8 p^3 tag{2} $$

To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:

$$ C_{LU} = 2 n p^2 + frac{4}{3} p^3 tag{3} $$

$$ C_{SVD} = 4 n p^2 + 12 p^3 tag{4} $$

Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:

$$ C_{LU} = frac{10}{3} p^3 tag{5} $$
$$ C_{SVD} = 16 p^3 tag{6} $$

This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $mathord{O}$ notation.

What you are doing is exactly the opposite - you have $n ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n ll p$.

Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $mathbf{U}$ and $mathbf{V}$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n ll p$.

answered 3 hours ago

olooney

1,326614

$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
2 hours ago

$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
2 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

pighead10 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f396914%2fwhy-is-computing-ridge-regression-with-a-cholesky-decomposition-much-quicker-tha%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

We see that relevant costs for the non-ridge case, assuming $mathbf{X}$ is $n times p$, are:

$$ C_{LU} = 2 n p^2 - frac{2}{3} p^3 tag{1} $$

$$ C_{SVD} = 4 n p^2 + 8 p^3 tag{2} $$

To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:

$$ C_{LU} = 2 n p^2 + frac{4}{3} p^3 tag{3} $$

$$ C_{SVD} = 4 n p^2 + 12 p^3 tag{4} $$

Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:

$$ C_{LU} = frac{10}{3} p^3 tag{5} $$
$$ C_{SVD} = 16 p^3 tag{6} $$

answered 3 hours ago

olooney

1,326614

$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
2 hours ago

$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
2 hours ago

add a comment |

We see that relevant costs for the non-ridge case, assuming $mathbf{X}$ is $n times p$, are:

$$ C_{LU} = 2 n p^2 - frac{2}{3} p^3 tag{1} $$

$$ C_{SVD} = 4 n p^2 + 8 p^3 tag{2} $$

To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:

$$ C_{LU} = 2 n p^2 + frac{4}{3} p^3 tag{3} $$

$$ C_{SVD} = 4 n p^2 + 12 p^3 tag{4} $$

Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:

$$ C_{LU} = frac{10}{3} p^3 tag{5} $$
$$ C_{SVD} = 16 p^3 tag{6} $$

answered 3 hours ago

olooney

1,326614

$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
2 hours ago

$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
2 hours ago

add a comment |

We see that relevant costs for the non-ridge case, assuming $mathbf{X}$ is $n times p$, are:

$$ C_{LU} = 2 n p^2 - frac{2}{3} p^3 tag{1} $$

$$ C_{SVD} = 4 n p^2 + 8 p^3 tag{2} $$

To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:

$$ C_{LU} = 2 n p^2 + frac{4}{3} p^3 tag{3} $$

$$ C_{SVD} = 4 n p^2 + 12 p^3 tag{4} $$

Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:

$$ C_{LU} = frac{10}{3} p^3 tag{5} $$
$$ C_{SVD} = 16 p^3 tag{6} $$

answered 3 hours ago

olooney

1,326614

We see that relevant costs for the non-ridge case, assuming $mathbf{X}$ is $n times p$, are:

$$ C_{LU} = 2 n p^2 - frac{2}{3} p^3 tag{1} $$

$$ C_{SVD} = 4 n p^2 + 8 p^3 tag{2} $$

To obtain equations for ridge regression, we need to substitute $n rightarrow n + p $ to correctly account for data augmentation:

$$ C_{LU} = 2 n p^2 + frac{4}{3} p^3 tag{3} $$

$$ C_{SVD} = 4 n p^2 + 12 p^3 tag{4} $$

Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n approx p$, we can obtained simplified costs in only one variable:

$$ C_{LU} = frac{10}{3} p^3 tag{5} $$
$$ C_{SVD} = 16 p^3 tag{6} $$

answered 3 hours ago

olooney

1,326614

answered 3 hours ago

olooney

1,326614

answered 3 hours ago

olooney

1,326614

answered 3 hours ago

olooney

1,326614

$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
2 hours ago

$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
2 hours ago

add a comment |

$begingroup$
Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.
$endgroup$
– amoeba
2 hours ago

$begingroup$
Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.
$endgroup$
– pighead10
2 hours ago

Explicit data augmentation would be a very slow way to compute ridge solution when $nll p$... Everything that scales as $p^3$ can instead be rewritten to scale as $n^3$.

– amoeba
2 hours ago

Great answer. This problem arose because a paper I was following suggested, since n ≪ p, computing the SVD in order to grid-search for the best alpha, since the resulting computation for each alpha only takes O(pn^2). However, my results/your answer shows that this would only take less time if there are greater than ~10 different values of alpha to try, due to the initial overheard of the SVD being huge.

– pighead10
2 hours ago

add a comment |

pighead10 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

pighead10 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Fdzfgy