"Long" but piecemeal. Which is a result of how they approached the problem.
They defined every possible curve in infinite dimensions based on 5 characteristics, essentially drawing the problem from an infinite space to an abstracted 5 dimensional space. Think: all black holes can be defined, so far as we understand them today, in an abstracted 3 dimensional space, that is [mass, electric charge, angular momentum](i)
Then started proving the theorem for subsets of that 5-d space. Think: proving a theorem for infinitely many integers by first proving for 0, then all even integers, then all odd integers, then proving for a small handful of odd integers that your odd integer proof fail on, then proving that negatives can be modified to the positives you have already proved.
It's really great seeing this practical approach laid bare.
Also, at the bottom, the paper includes code for "Section 11: Most of the Sporadic Cases", so if you're unfamiliar with the math but are familiar with standard python then you both gain further understanding by seeing the code process, but also by comparing the code to the included corresponding, highlighted as comments, theoretical proofs' propositions.
For example, from the paper(ii):
def can_induct(d, g, r, l, m):
# Proposition 8.4
if d >= g + 2 * r - 1 and good(d - (r - 1), g, r, l, m):
return True
I'm wondering if this type of research can be applied to deep learning. SGD is a very slow and inefficient way to train models, can we find another way? A set of datapoints maps to some unknown curve (loss surface) for a particular model architecture, where each point is a set of specific weight values. Can we come up with a better way to find the minima (points with the lowest loss) given a few random (or specifically chosen) points? Can we interpolate the curve?
If we consider a specific example, say GPT-3 training, and look at a dataset as a collection of paths on a curve (in a word embeddings space, so perhaps it's a graph that could be interpolated into a curve), then the training is trying to map a bunch of these paths (word sequences) to a single downward path somewhere on the loss surface. Currently, this process is random - we randomly sample paths on the dataset graph, trying to build a path of small steps on the loss surface. But what if we could analyze the dataset as a whole, and extract some properties that would constrain the search in the parameter space? Or at least choose the training sequences more carefully to improve learning efficiency.
You're talking about second order optimization. The full-fat version is too slow (the Hessian, the matrix of second derivatives, is huge for DL models), the optimized approximations don't seem to provide a meaningful benefit over first order methods like SGD (or more commonly, Adam).
A more promising path is to regularize the loss function so that SGD can optimize it quickly, rather than switching to a second order method.
No. I'm asking for a new kind of math for ML, inspired by the paper in this post. I would like to treat a dataset as a curve, which maps to another curve (loss landscape), under the model architecture constraints, and then find the minima on that landscape in one shot. There might be some kind of interpolation possible which has nothing to do with SGD optimization (of any order).
When I learn something new, for example a difficult topic like functional programming, or signal processing, the process feels nothing like SGD. At first I gather some background information, motivation, goals, etc. Then I unpack the concept/formula, and try to understand how it works, how the pieces fit together. Could be bottom up, or top-down. The information is accumulated until it reaches critical mass, and then boom - I get it. That stage happens quickly, it's an "aha" moment. This learning process involves both statistical pattern matching, and some other mechanism, perhaps referring to a rule based system like a decision tree, or accessing some facts in my long term memory database. I have no idea how a brain does it, but clearly I don't need a thousand examples to get Fourier transform, or monads, or whatever it is I'm learning about. Often just one example is sufficient to form new patterns and rules in my mind. I'm guessing this is because we are not randomly probing the loss landscape in the dark. We might be interpolating it, filling in the blanks. I do, however, need a thousand examples/repetitions to learn a new motor skill, like playing tennis or a piano, or even learning a new language - training my ear for many hours until I start recognizing the right sound patterns, so there are types of learning I do which might be similar to SGD. But most mental concepts are not learned like that.
Well, it is enumerative geometry: counting things. One would expect (a priori, not necessarily) that any method to give an answer to the problem can be used to count. That is, it is programmable.
Not trying to argue, just to specify the type of result.
> At Quanta Magazine, scientific accuracy is every bit as important as telling a good story. Since Quanta is a nonprofit foundation-funded publication, all of its resources go toward producing responsible, freely accessible journalism that is meticulously researched, reported, edited, copy-edited and fact-checked. And our editorial independence ensures the impartiality of our science coverage — our articles do not reflect or represent the views of the Simons Foundation. All editorial decisions, including which research or researchers to cover, are made by Quanta’s staff reporting to the editor in chief; editorial content is not reviewed by anyone outside of the news team prior to publication; Quanta has no involvement in any of the Simons Foundation’s grant-giving or research efforts; and researchers who receive funding from the foundation do not receive preferential treatment. The decision to cover a particular researcher or research result is made solely on editorial grounds in service of our readers.
But there's a risk that their kids won't care about math or won't amount to anything notable in STEM, whereas these 2 are proven to be effective at mathematicians.
You know the debate about 10x engineers - and how many people weigh in saying they are real, that 10x may even understate it? Here you have to 10x (minimum) mathematicians. You are suggesting we give that up, that they sacrifice the quiet time mathematicians need to do great work, and devote themselves to raising kids, on the off chance that (some of) their children may contribute more than they will.
Imagine Steve Jobs having his first success (incorporating Apple, selling a few units) or Bill Gates (selling some units of BASIC and random software, how Microsoft started) - then quitting to raise a bunch of kids. Is that a good trade? What are their kids up to today - and could we expect much better even if the had 10 times as many? Would you really expect those kids to accomplish more than Jobs and Gates did?
When put this way, it obviously sounds like a bad trade.
True, but government can still try to nudge behaviour.
Singapore famously established the social development unit (SDU) (nicknamed "Single, Desperate and Ugly") to encourage graduates to have more children, by organising speed dating events, BBQs, cruises etc. They also have a "Baby Bonus Scheme" [0].
"SDN’s long term goal has always been to help singles realise their marriage aspirations through equipping singles with relationship skills and creating a vibrant dating scene for singles to interact." [1]
Governments nudge because if there is not a continuous thread of culture and people to carry it over a long timespan than a lifetime, then it’s values and long term designs are lost. Immigrants are important but they need to be integrated in something the natives prepare and preserve, otherwise those immigrants would not have moved for something better.
The United States have successfully integrated immigrants of many peoples into a framework of life that is still recognisable as European enlightenment. The natives are now not necessarily Europeans but they have been living and reproducing long enough to be part of the fabric that keeps the original European enlightenment tradition in place.
I didn't read the (flagged to hell) GP as implying an obligation. "I strongly encourage so and so to blah and blah" is about as mild as it gets on opinions about other's lives?
I wish that was the typical "get in other people's business" tonality.
More and more I notice that people have a huge cognitive dissonance in accepting that all people were once children and that these people they know and esteem were once someone’s burden. I also find it somehow naive to believe we are somehow not bound by our biological husk. We enjoy a good and full life on the backs of our parent’s toil, and due to not remembering or not being in contact with families we can feel that life is great and immortal and it always was.
I like to think of the hard part of raising children as paying the debt we owe to our parents and ancestors. If not for their survival I would be able to say I am very glad and grateful to be alive and experience the wonders of the universe, wonders like math.
Kids are not some kind of start up scaling problem. Kids require a lot of resources and attention, no wonder parents opt for a smaller number which still is a challenge but a manageable one
The cheapest way to scale mental problems is to use the state to replace the father's role in a family. I know plenty of large families with perfectly well adjusted kids. I can't say I've met any well adjusted kids raised by a single mom.
I think this is some much needed optimism about the higher ed and educational system. True, these people are exceptional, but I think the media only looks at the things that wrong with the educational system.
Most top STEM departments are very immigrant-heavy (sometimes more than 50%). It can be both true that US universities have top departments, while Americans are disadvantaged at getting in (by having weaker foundations, particularly in math).
A STEM professor once told me he needed a Masters in math before they would let him in. They doubted an American with just an undergrad degree could handle the technical challenge. This was almost 2 decades ago, since then the gap may have widened a bit.
Most places that recruit globally are likely to have a greater proportion of international members, especially if language and social factors are less important for success. So being immigrant-heavy doesn't necessarily indicate a relative failing of the US educational system.
While PhDs without masters are common in Europe, universities may have doctoral training centers that have an extra years to act like a masters. My wife went straight from undergrad at Oxford and her PhD was four years instead of the usual three with the first year consisting of short courses & projects.
Huh, interesting. Most of my friends who have PhDs don't have masters and that's Europe-wide! The UK definitely is different though. I got onto a Masters in CS without a Bachelors and was discussing carrying on into a PhD which would have been pretty interesting to explain on my CV.
Here's a link to the pre-print: https://arxiv.org/abs/2201.09445