🚨 The PRE 2025 Workshop recordings are now live! 🚨 Whether you're applying to a pre-doctoral program or looking to succeed once you're in, the panels are packed with invaluable insights from top economists, researchers, and PhD students. Plus, don’t miss hands-on walk-throughs of common data tasks in R and STATA. Watch now!
Applying to a Pre-Doctoral Program
Martha Fiehn (PhD at UCSD):
That was a super helpful and comprehensive introduction on the overlook of pre-Doc. And now we're going to talk about applying to pre-doc with us. We have Vaishali Garga from the Boston Fed, Ashley Pandya from Chicago Booth, Alvise Scarabosio at Stanford, and Eric Zwick also at Booth. So in reverse order, maybe we can introduce ourselves. So Eric, why don't we start with you.
Eric Zwick (Professor, Chicago Booth):
Hi everybody. I am a professor in economics and finance at the business school. I've been at Booth since 2014 and have worked with, I should have counted but probably a couple dozen pre-doc over the time I've been here and I'm really looking forward to the session. Thanks very much.
Martha Fiehn (PhD at UCSD):
Great. Alvise.
Alvise Scarabosio (Research Fellow at SIEPR):
Hi everybody. I'm Alvise. I am currently a pre-doc at Stanford working for Professor Grant Miller at SPIER. Before this I did my undergrad in the Netherlands at Erasmus University and then I did a master in development economics at Yale. I started working for Professor Grant Miller about a year ago and I'm entering my second year working with them.
Martha Fiehn (PhD at UCSD):
Awesome. Ashley.
Ashley Pandya (Research Professional, Booth):
Hey everybody, my name is Ashley. I am a first year, so I only started a few months ago at Chicago Booth and I recently completed my undergrad at UC Berkeley in Econ
Martha Fiehn (PhD at UCSD):
And Ali.
Vaishali Garga (Principal Economist, Boston Fed):
Hi everyone, I'm Ali Garga. I'm a principal economist in the research department at the Boston Fed. I started here in 2019 and I've worked with multiple RAs as well, not as many as Eric and I can tell you a bit more, but we used to have a one-on-one system with RAs and now we've kind of moved into a pool system. So that increases the scope of RAs that I get to work with.
Martha Fiehn (PhD at UCSD):
Great. And then I'll introduce myself as well. My name is Matta Fee. I'm a PhD student at uc, San Diego. I did my undergrad and economics at UC Berkeley, a master's in international economic policy in Paris. Then I worked for Stephanie Che at Harvard for two years and William Nord house at Yale for a year. And the way that this is going to work, I have some questions for the PIs. I have some questions for pre-doc and then some that work for both. And whoever feels inclined or inspired to answer can answer first, and if not, I'll call on somebody or we can see how the flow works best. So the first question is going to be for a PI and we're curious what are the principles, skills, qualities and prior experience that you're looking for in applicants?
Vaishali Garga (Principal Economist, Boston Fed):
I could start
Eric Zwick (Professor, Chicago Booth):
For it again. Sorry. I guess we didn't talk about the order. Go ahead.
Vaishali Garga (Principal Economist, Boston Fed):
So at the Boston Fed, I think I can break it down into maybe a few different categories. So we look for RAs with first and foremost passion, interest and those who show some initiative and examples of this is like you read econ papers or you are thinking of ideas or you worked on some thesis projects in the past. All of those show some interest. Then we look at technical skills and these include skills either with econometric tools or methods with handling data sets like large data sets, small data sets, public data sets, if you've cleaned these, merge these or if you worked on research projects again as part of your thesis or with some professors or at some institute as an RA. Those are all very useful. And maybe thirdly, we also look at the ability to work collaboratively because that's very important for us here at the Fed because you will always almost not almost, you will always work with an economist, so you at least need that one-on-one collaboration skills and then offer projects are joined with other economists who also have RAs. So you kind of work in a team and we need people who have shown evidence of that or are inclined to engage in that kind of interaction.
Eric Zwick (Professor, Chicago Booth):
On my end, I mean a lot of overlap, a lot of correlation there, but definitely technical background is something that we focus on. Lots of math, taking hard classes and doing well, although you don't have to do perfectly. I think we tend to prefer an a b plus in a really hard math class to an A plus in some less relevant class and demonstrates that you're used to working on hard problems and getting better at doing that. We also look for some CS and programming because our lab uses a lot of large data and so some background in that is helpful but not necessary. And then on the softer side, sort of a little bit of evidence of passion and leadership skills, like some hardworking markers and sort of what kinds of activities or clubs or other things you've done aside from just taking the classes just as a way of thinking about independent work habits, discipline and so on. But those things are hard to evaluate, but those are other things that we kind of look for.
Martha Fiehn (PhD at UCSD):
Great, thank you. So both of you mentioned passion and that kind of leads into a question for all panelists, which is how important the fit is in research interests between the PI and pre-doc. So for instance, I knew I wanted to work on tax policy, I continue to want to work on tax policy, so I was very targeted in my approach, but not everybody is in that position. So how do you think about the passion? How do you think about the fit and why don't we start with Alvise?
Alvise Scarabosio (Research Fellow at SIEPR):
So my impression is that, I mean seen from an applicant side is that that is really important because compared to for instance, PhD apps where whoever is doing admission selects a multitude of people, the PI is just choosing one. And so they will want someone that has a very good fit because they'll work one-to-one for two years. So I think in general when I was applying, I was trying to look for people that I could see myself working one-to-one for two years and that the research they were doing, I looked at their recent papers or the kind of research they had done in the past and I was trying to look for something that I was like, okay, yeah, these projects look very cool to me. I could be working on these for two years. And I think I wouldn't recommend just sending applications to any available spot because most of these positions are very competitive and so they'll want to know, the PI will probably be looking for someone who's aligned with the research interest. If you've never even thought about the problems they're researching, I think you wouldn't have a good chance and you are better off just trying to focus on those positions where they're very aligned with your interests.
Martha Fiehn (PhD at UCSD):
Great, thank you. Ashley, do you want to talk about your experience?
Ashley Pandya (Research Professional, Booth):
Yeah, definitely. I can speak to it a little bit too because this is fresh in my mind. It was almost only about a year ago when I was doing my pre-doc applications and I came from IT from the position of the RA position that I had done in undergrad was one form of economic research. The thesis that I had done was entirely different and by the time that a few months later it came time to applying to pre-doc, it wasn't actually explicitly clear to me what specific sect in econ I was super interested in. I just knew that I wanted to econ research. And so the pre-doc that I applied to actually ended up being of disciplines of all sorts, but as VIS was mentioning, it had to be something, there's something about the methodology that they were using or about the research questions that they generally touch on.
That was interesting to me. And I think that passion came across in the interviews came across in the cover letters that I wrote, things like that. So I definitely think that having that passion for specifically what the researcher is doing does come across, especially if you make an effort to make that come across explicitly. But I will say that if there are, and I'm speaking just from somebody who is an applicant, not necessarily a pi, obviously if there is a position that you are interested in applying and we are going, oh God, I don't actually have that much experience during research and tax policy, I would not self-select out. I definitely don't think that that's a non-starter. If it does interest you and you can sort of generally tie it to your research experience in the past, I personally say definitely go for it.
Martha Fiehn (PhD at UCSD):
Great. And Eric, do you want to talk about how you evaluate candidates that for instance, don't have the experience in a given subfield and how you think about whether they're still a good fit?
Eric Zwick (Professor, Chicago Booth):
Yeah, we can take turn. We can alternate who goes first. So definitely, I mean I work on pretty nerdy tax stuff a lot of the time and definitely if I restricted the people that I worked with to people who had experience or interest prior to pre-doc in that there would be almost very few people that I would be looking at. So I definitely don't focus on a really tight fit based on background. It's sort of more a general openness in applied micro using those types of tools, doing some data work, maybe a bit of theory, but so sort of broad types of things that I expect the person to be doing. I want them to be looking for a job that provides those. So not a pure theorist, not a macro calibration type person. But I think the background that sort of demonstrates some skills in some interests that are broadly adjacent to public policy and economics research is sort of enough. And those people have worked really well. We do in our interview process have a period where they read one of our papers and we ask them some questions about it. And so folks who seem to enjoy that, it kind of comes through in the interview and that helps a little bit even if they've had no prior experience.
Vaishali Garga (Principal Economist, Boston Fed):
I think Eric, my thoughts on this resonate strongly with Eric's. I work in monetary economics, some of my work is super theoretical, and so no undergrad is going to have that kind of deep passion or background in monetary economics. So I think again, general passion for the field and just having some skills or tools which overlap across fields, I think that's very useful. And one other thing at the Fed is that we, as I was mentioning in the introduction, we now do a pool system of RAs. So you actually, when you apply you too broadly to the research department, which is broken down into different groups. And then once you're placed, you're placed within a group with each group has five to six economists. And so each of us also work on different topics. And so therefore we don't really need someone who's particularly tied to what I do, for example, because my colleague might work on something which is different. And so in that sense, maybe the Fed is a good opportunity for someone who's less sure, like Martha, you were saying you were super sure about tax policy and then it makes total sense that you match with a professor who works on that. But if someone's less sure with the Fed, you can kind of cost a wider net and you can also help to maybe also help to give you clarity on what you want to eventually do after the pre-doc position.
Martha Fiehn (PhD at UCSD):
That makes sense. Thank you. And I do think there's different mechanisms at different institutions. So I know at Yale also they have more of a pooling system and are very organized. At Harvard there's less of a pooling system. So as applicants, that's something that you can inform yourselves about a little bit, try and reach out to see if there's pre-doc coordinators like Steven, super helpful resource for anything at U Chicago. But also generally I would say just something to think about. Now we have a question for the pre-doc and we'll start with Ashley. And I'm curious what motivated you to look for pre-doc positions?
Ashley Pandya (Research Professional, Booth):
Yeah, definitely. So to give a quick outline of my background, I graduated from undergrad last spring and during that last year of undergrad, I hadn't thought for a singular second about applying for pre-doc. My entire focus was on applying to industry positions, corporate, maybe even public policy, but primarily industry. The idea of doing a PhD was in my head something that maybe could happen in the future, but I had not put that much hyper focus on it. And so if that's you somewhere in this panel, somewhere in the audience, I resonate with that. And it just so happened that I had gotten a position during that last year of undergrad and that position right as I was graduating ended up falling through. And so I was stuck with a few months during the summer trying to figure out, okay, do I continue to apply for other industry positions?
It's kind of a bad time for it. Nobody I know around me is really getting anything. It's also just a pain. You feel like you're shouting into the void or do I want to pivot to something else? Funnily enough, I was actually complaining to one of my former GSIs at Cal. She's lovely, and just literally just complaining like, oh my God, I'm applying to everything and nothing. I'm applying to things that I don't even really want to do and I'm still not getting them. And that's super frustrating. And she said, well, Ashley, I know that you had expressed interest in doing a PhD. At some point did you consider maybe using this cycle to apply to a pre-doc? And at that time I had known that pre-doc existed, but I didn't realize that they had this sort of newfound position of being a necessary but not necessarily a sufficient condition to being a PhD.
And so I took that August timeframe basically to start applying. And I noticed almost instantly that at least in me, there was a step change in my mood in applying to the pre-doc and there ever was in applying to any of the industry positions, I was sincerely and kind of a simple nerdy way enjoying getting these data tasks sitting down doing my homework. I was used to going through interviews with people who seemed genuinely interested to talk to me and to talk about my research interests. And it was that difference in, oh, I can feel myself really enjoying doing this long application process than I ever had before that I knew, okay, this is probably what I'm going to be wanting to do.
Martha Fiehn (PhD at UCSD):
So Ashley, it sounds like you have more of a domestic US experience and maybe you can talk about how it is from the international lens, especially since you are Yale doing your master's as well.
Alvise Scarabosio (Research Fellow at SIEPR):
Yeah, exactly. So I think I started my master's at Yale and went into it not sure whether I wanted to do a PhD or not. So I was still looking considering, oh, maybe I'll want to work in policy and things like that. And then I think one month into being at Yale, I was like, oh, I think I really like research. I had been going to seminars almost daily and felt like, oh, I think I can see myself doing this for in the future, but I didn't quite feel ready to apply for PhDs. I had just started my master's apps for PhDs for two months later, had none of my materials ready. And so I sort of pivoted and said, okay, let me look for something that I can do to bridge from my master's to my PhD. So I guess around September October, I started looking to pre-doc positions and thought that that could be a good opportunity to actually figure out and change my mind in case I didn't end up liking it. And I thought it would be a good fit to sort of go from the master's and get all my materials ready to apply for a PhD later on. And so yeah, I think wasn't just like Ashley said, I wasn't really sure at the beginning, but then I think I really enjoyed the process and going through the professor, I think I found so many people that were doing co-work and thought, yeah, I could see myself working with a lot of these and it ended up being a pretty fun process.
Martha Fiehn (PhD at UCSD):
Great, thank you. So let's get back to the sort of details. There's a few different components of an application. And so for instance, with a PhD application, an advisor told me that the standard, the statement of purpose is basically to see if you're literate. So they don't really care what you say in there. They might use it to match you with the professors you mentioned on visit day, but it's not something that you should focus a whole lot of your energy on for PhD applications. This is at least the advice I received. But so maybe I'd like to hear from Vaishal and Eric. Eric already mentioned that you're looking for really strong math classes, but how much weight do you put on each part of the application? So like the cover letter, resume coding, task writing, sample transcript, references, interview, whichever do you think applicants should prioritize the most. And I think we'll start with Vaishal first this time.
Vaishali Garga (Principal Economist, Boston Fed):
So we do the process in two stages. So in the first stage we just look at the written material and in the second stage there's an interview. There is no coding test at the Boston pad anymore. And then in the first stage, the materials we ask for are the resume, the cover letter, the transcript, or writing sample and letters of recommendation or at least two. And I think it's difficult for me to quantify each of these, but I would say that maybe the resume, the letters and the writing samples are probably the most important given the skills that I was highlighting in the first question that you asked Martha, because all of those tell us about your passion, your research interests, your technical skills, your background. I think with transcripts it's a little bit difficult to answer this because different people read it differently, but I would say on average we don't let someone fall through the cracks just because their transcript isn't as strong.
So let's say you didn't go to the top school or you didn't have a very good grade in one semester in one math course. Math courses are important for us too, but I think we don't weigh too much on that. And this is because I think for us at the Fed, given our public service mission, mentorship is kind of also very important. And so we are looking also mentor candidates who don't come from maybe the strongest background. One thing I will say is if there is some sort of either gap in your CV or you had this bad grade, which you want to explain to us, in that case, I think the cover letter is a useful thing to try to convey that to us. If the cover letter is just going to be a summary of the resume, I think it's less important because we are looking at your resume anyway, so I would spend the least time on the cover letter if I were, unless you were trying to convey information that's not in the resume.
Eric Zwick (Professor, Chicago Booth):
Yeah, I would agree with everything by Ali said when I was ordering things, I put the resume and transcript first, then the interview, then the letters and writing sample. But they're kind of in sort of a third bucket together. The coding task kind of lasts, and the fourth quartile coding task is informative. It shows that somebody doesn't really care, but then it's sort of becomes less informative. And we'll talk about the AI stuff later I guess. But on the writing sample, this is supposed to be kind of the best work you've done in college or master's. And if it's coherent, interesting, creative, that really is distinctive and if it sort of is kind of underwhelming or unfinished or some other or lots of typos or something, I think that's a bad sign.
The interview, we'll talk about that a little bit more I guess in a bit. But when we talk about mistakes and stuff, but I think presentation and preparation for the interview is quite informative and the live interview is just really valuable. You can't put too much weight on that only 30 minutes or maybe two times 30 minutes, but a lot of people that look great on paper just have trouble answering questions and that tends to indicate something. And then on the transcript, yes, not a perfect transcript. The people with the really perfect transcripts, often they just go straight into grad school and the pre-doc is kind of looking to supplement the transcript or where they're coming from. I mean, was pre-doc coming from a liberal arts college? Not like a top research university. So I kind of used the pre-doc as a way to get some more research exposure prior to grad school. But yeah, I think also we encourage folks to take classes and we help cover the cost of some of that if there are gaps in the transcript. So the pre-doc is an opportunity to fill those gaps, but if there's no math in the transcript, it can be kind of hard to overcome that in a two year pre-doc when it comes time to apply for grad school. So there's certain limits for how big those gaps can be.
Vaishali Garga (Principal Economist, Boston Fed):
Sorry, I just want to add to Eric's comment about the interview. So for us, one thing that's different is that sometimes your written material is only looked at by a couple of people on the committee, and then in the interview stage you're interviewed by everyone on the committee. So in that sense, the interview is actually quite crucial because you have made it to, you have your foot in the door because a couple of people looked at your application materials, the written stuff, but now with the interview, you really need to come through for everyone on the committee. And like Eric said, some people do look great on paper, but then in the interview they struggle a bit and we will come to that. But that I think can be quite bad in terms of making it beyond that point. And then also, yeah, we also, the Fed has a very good program for sponsoring courses at institutions in Boston. But again, again already said, unless you have some fundamental knowledge on the topic, it's difficult to do advanced courses and have them be useful.
Martha Fiehn (PhD at UCSD):
Yeah, that makes sense. Since we're already on the topic of interviews, let's hear from Alvis and Ashley about how you prepared for the interviews, what kinds of questions you are asked, and maybe if Eric and VA then want to expand a little bit also about common mistakes you've seen applicants make during any part of the process. Let's talk about that. I think Alvise first this time.
Alvise Scarabosio (Research Fellow at SIEPR):
Sure. So I speak from experience, a small sample of interviews, maybe four or five. But even this small sample, I felt like there was a lot of heterogeneity in the type of questions that I got. So it's hard to just say, oh, you'll be asked these kind of things. I think some professors might ask me about something about econometrics and try to test my understanding of econometrics while some other, I think especially from the big labs, I think my experience was more like it felt like they trusted the process that brought me to the interview. So they thought, okay, this guy is probably qualified, let me see if he is a good fit for me. And we get along working together for two years. So I think there's a variety in that sense. I think I've heard people being asked questions about coding as well or do a sort of a live coding task where they're asked, oh, can you do this and this?
That was not my experience. I think in general I would really make sure you can tell about yourself and you can describe everything you wrote in your cover letter because I think in all interviews they ask me, oh, you wrote that you've done this, you worked as a research assistant here. How was your experience? What did you do? And so try to get that understanding or ask me about my interests and my passions, what I was doing in my master's, but I would also maybe recommend trying to be prepared in econometrics especially. That was, I think I got almost in every interview I got a question about some methods or how would I approach a problem or things like that.
Ashley Pandya (Research Professional, Booth):
Yes, sorry. Yes, I will co-sign on that. I think that the approach that I noticed my interview is taking as I went through the interviews was the same mentality that I ended up undertaking, which is if I had gotten to that point, then I was qualified enough to get through that point. And so most of the questions that I was asked were fairly standard, asking me about my research interests, my research experience, some quantitative experience questions, asking about my background a little bit that didn't have to do with research, but maybe had to do with some coding or data analysis. I got questions like that. There were no gotcha questions. I never went into an interview feeling like I wasn't prepared for that. I didn't really know what they were going to ask, nor did I ever leave feeling, oh my God wasn't what I expected at all.
That was actually quite scary. Nothing like that at all. Everybody there is just trying to get to know you, at least from my experience, maybe some of the most technical questions that I got had to do with econometrics, asking questions that would probably be on the order of the difficulty that you would've already done in the data task. So again, I would say that by the time you get to the interview process, just focus on realizing that you are there because they want you to be there. You're qualified enough to be there, they know that and know what your research experience is, know why you're there specifically, are you interested in their research, their methodologies? Why is this a one-to-one fit rather than why are you just applying to pre-doc in general? I think that would help build the connection better as well. And also help you understand why am I applying to this one or am I just applying to pre-doc in general? Yeah,
Martha Fiehn (PhD at UCSD):
Yeah, that makes sense. And I do think the application process can be really informative to helping you navigate what you're interested in. I know for me at one point there was a data task and I was already bored by it and I was like, wow, if I'm already bored by the data task, I probably shouldn't continue to apply for this position. So I rescinded my application and I think that was a good choice. So yeah, let's talk about some of the common mistakes you've seen applicants make, Eric and Ali, and then maybe Elvis and Ashley can also, if there's anything you regret or anything you wish you would've done differently, talk about that. And then after that we'll talk about the use of AI during the application process. But you can tie those in if you like. And I think Eric, yeah,
Eric Zwick (Professor, Chicago Booth):
Sure. Yeah, so I mean I think in the, I'll focus most on the interview, but on the submitted materials, if it looks like it hasn't been edited or even looked at by your human eyes before being submitted and just pasted and just kind of looks messy again, typo, basic kind of stuff, that's not a good sign of care of being careful and attentive to the work product you're submitting. So in the coding task for instance, the writeup will just look at that a little bit and some folks seem to, it just looks it's sloppy and poorly organized. So in the interview you can kind of err on in a lack of enthusiasm, and I am going through the motions like Hum, this is just another interview, why am I even here? And that demonstrate that you're actually enthusiastic or excited to have this job and then you could also be a little too excited and not stop talking.
Kind of like we only have 20, 30 minutes to go through three or four or five questions and multiple people want to ask questions. And if you talk uninterrupted for 10 minutes, that's like a mistake because we can always ask you to expand, but it's hard to interrupt, especially in Zoom and say, please contract, please stop talking. And it's sort of, so I would sort of be organized, shorter answers are better than the longer ones with follow-ups, maybe practicing answers a bit before on some of these questions, I like to sort of use the writing sample here or what the people we're working on before and ask them to go through a little bit the economics, the why, what are the weaknesses, what was the key question you were asking? How did you try to answer it? What were the limitations of what you did? And try and just get to somebody's understanding and only if they have nothing in the background like that, then I'll start to think about some basic econ 1 0 1 type questions that I can ask. But even if somebody doesn't get answer immediately, we'll start to reason through it and I'll just sort see if they get really stressed by that or can comfortably just reason through it even if they don't get the answer. That's not the kiss of death at all.
Vaishali Garga (Principal Economist, Boston Fed):
Great. I had a bunch of those points also in my list, so I'll just focus on the mutually exclusive points. So like Alvise was saying, I think we also give a lot of room to candidates to guide the conversation. So the interview will often open with tell us about yourself, tell us about some experience with some dataset in the past or some research in the past. And from there we go deeper. And here I think maybe the candidates sometimes can make a mistake of either picking research projects that they think are trendier topic wise but they haven't contributed deeply to or something that they think reflects technical skills. But then ultimately when we delve deeper, it becomes clear that they were not really involved in the technical part, for example. So I really think use the question, the open-ended questions as an opportunity to guide the interviewer towards a project that you're really passionate about that you've contributed to meaningfully and that you can answer questions about. And this can very well be your writing sample. This can be something from your resume. You don't have to deviate from what you've given to us already.
The other thing is that in terms of enthusiasm, I think one thing also that happens is we ask a question and maybe the interviewee doesn't know the answer. I think it's okay to say, I don't know, but I'm happy to try to reason this with you as we go along and start from there. Sometimes they'll give either a wrong answer or they'll give a very long answer, which just then seems evasive. They're hoping we get lost and we don't catch on. But there's many people on the interview committee, we will likely catch on. People try to use jargon to try to impress. Then when you delve deeper, they truly don't maybe know and maybe jargon was not necessary anyway. And for that, I would really also emphasize that we are looking for someone with good economic intuition.
You may be the most technically gifted person, but ultimately half of research is communication and especially so in policy work. And so if you go to our website and you look at our current policy perspective issues, these are our policy briefs, you'll see how much they cater to an average person without an econ PhD. And we really want you to approach the interview that way is be intuitive and be simple if you can. And then maybe two additional points were applicants can come across as maybe disinterested, for example, if you take too long to respond to interview requests or you don't show flexibility in terms of when you're available to interview. So we understand some of these fall around exam time, but then some candidates will be like, oh, I can't meet the entire week. But we also know the exam schedule, so if you can't meet the entire week, it shows us you are maybe not as flexible or not as interested and someone who shows more interest will probably get selected before you because this is on a rolling basis. And the final thing I'll say, and this is not specific to the Fed interview, is I think just being humble in interviews is quite important and it's okay to not know things. It's okay to know the limits of what. And I think approaching it with that mentality is you're trying to find someone you would want to work with for the next couple of years and then someone who would want to work with you for the next couple of years. So those are my additional points.
Martha Fiehn (PhD at UCSD):
Great, thank you. It actually happened to me at an interview right after undergrad where I thought I communicated well that I wanted to take the job. I worked at a job force development nonprofit called Jewish Vocational Service and I didn't respond and then they were nervous that I didn't want to take the job and had to go through the whole process again. So yeah, it definitely makes sense to try and be as responsive and clear like yes, I definitely want this position. I definitely want this interview that reflects my experience in industry as well. Ashley and Elvis, do you want to add something about regrets, mistakes, they don't have to be your own, maybe they're from other applicants, maybe they're from your friends, common pitfalls or anything along those lines. And I think this time starting with Ashley,
Ashley Pandya (Research Professional, Booth):
Sure. I can speak maybe since we're talking about interviews a little bit, I would absolutely co-sign on what Ali was saying towards the end there especially that I think the move for playing these interviews, and I think maybe most job interviews in general is the interviewer is coming from a position of can I work with this person and do I want to work with this person? And so when there are these technical questions that get thrown at you or even questions about your background that maybe you're not super, super comfortable in, are you able to work through that with them reason through that with them? If there is a technical question that they ask and you genuinely are a little bit confused, start asking questions about it. Oh, so do you mean in this instance under these assumptions or do you mean that, and I think that that maybe even shows a little bit more about you, that even when you're confused about something or you're exploring uncharted territory, can you work through it?
Can you troubleshoot with them ultimately what the job is going to be as a pre-doc troubleshooting with the PIs to some degree rather than do you know everything because at some point that well is going to run dry. So I think that's both very, very helpful answers. I don't know if I necessarily have any strict regrets in my pre-doc application process. If anything, I think I maybe could have prepared myself a little bit better just having known earlier on that I was going to do a pre-doc changing up, made my research interests, I don't have the most solid quantitative background. If I'd known that I wanted to research earlier, would've focused on that. But obviously that's in the past, those are things that I wouldn't have been able to change anyway.
I think maybe something that I had begun to do in the application process that I stopped part of the way through was something we had spoken on, which is I was kind of just applying to every pre-doc to a certain degree because at some point it's like, okay, I know how to write a cover letter, my resume is what it is, my writing sample is what, these are very similar in terms of the initial push, I'm just going to apply to whatever. And then I actually started getting a lot of data tasks and interviews and that not only ended up taking a lot of time, but I ended up realizing in some of those interviews, Ooh, I'm actually having a difficult time justifying why I'm in this interview specifically because I don't even know why I'm in this interview specifically. I'm not super familiar with their research, not super passionate about it, and I could feel it coming through.
And at the same time, I think I also towards the beginning self-selected out of positions that I didn't think that I could get. So I thought why waste my time on those? And when I started to apply to some of them, some of those ones that seemed more further away out of reach ended up coming through. So I would say maybe take a little bit more time than you would think intuitively to apply to positions that you think would be a good fit or figuring out even what a good fit would look like for you. If anything, that might've been one of my regrets not doing that earlier.
Alvise Scarabosio (Research Fellow at SIEPR):
Yeah, I'll keep it short, but I think when Ali was describing some mistakes, I remember myself writing my cover letter about this research assistant job that I had and I wrote, oh, I've done this very technical analysis, whatever. And then in my first interview they asked about it and what I had done was really just sort of running the code for it and I had no clue. And I think that reflected very poorly on me. And I remember finishing that interview and deleting two sentences from my cover letter and I think in other interviews where they actually asked me about the thesis that I had done and things like that. And then I knew very well, I think that came across and then I actually knew what I was talking about and I could show that I understood the metrics and the methodology and I had it clear in my mind. So I wouldn't just try to fill in your cover letter with things that maybe you've seen once and just write all that. Yes,
Martha Fiehn (PhD at UCSD):
There we go. Great. So for the sake of time and the question I've actually been most curious about is how you're thinking about AI in the application process and if you'd like even during the pre-doc and then we'll move on to the q and a. Okay, let's see. I think Eric is maybe first or maybe in any case all panelists are welcome to answer.
Eric Zwick (Professor, Chicago Booth):
So I didn't ask AI what to say. Just to be clear, I wrote some bullet points myself with my fingers, a lot of weight on in-person interviews before AI or chat GBT. And I think the weight will just kind of increase a bit that plus the transcript, some of the hard stuff. But it's not to say that people shouldn't be using these tools, it's okay to use them. I think we need to figure out how to use them well to produce work product that we're proud of. And I think we as society are still figuring that out. It's not just about saving time. It's like can you actually produce something with quality using these tools? So I find that they're more useful as better search engines than the existing previous search engines. So instead of going into code exchange or stack exchange or whatever, I think they're pretty good at answering the questions I would've previously used Stack Exchange to answer, but when it's solve this coding task, they're going to spit out some code that doesn't run or doesn't do the right thing.
And I think that's still kind of the case, not always but kind of the case. So we'll probably test some of the code to make sure it runs and we'll probably do some correlation across to see if people are just feeding into chat and then spitting back out the same answer. And if we have a bunch of applications of the same answer as everybody else, you have to think about the equilibrium where everybody is doing this. You'll get caught by that. And that's just evidence that you're not really spending much time taking it very seriously the job application in the first place. And so when you're asked to do something that the AI tools can't do, which is going to be most of the job, probably that's not a good sign if you just want to default to that. But I don't know if we know for sure where we'll settle, but that's kind of what we're thinking about. How
Vaishali Garga (Principal Economist, Boston Fed):
I will also say that, yeah, I mean I think maybe using them as compliments is fine and we are trying to use it in research. We are trying to use it in our lives as compliments, but maybe don't use them as substitutes because it really currently what's written a hundred percent by AI and not, I think you can tell from reading what's written and also given with that we don't do a coding test. I think the use of AI in the application is quite limited at this point for us. I mean maybe you're using it in your writing sample, but we are going to ask you about that in the interview anyway, so the gaps will come through there and maybe Ashley mentioned this point of the fixed cost of the application is actually quite low. You prepare the same material and you kind of send it out.
And the place where you write to personalize I guess is the cover letter. And there you really do need to do your research. So I would say using them as compliments to maybe check for typos, improve the flow of what you're giving us, shorten bullets in your resume, all that's fine. And then just we're not formally thinking about this yet. Our meetings of the recruiting committee for the next cycle have just started and the question of AI in applications has not really come up. So I don't have any insights to offer from there, but I'm actually curious to take it back to my colleagues and see if any of them are thinking deeper about this yet. And maybe one other thing I'll say is because of the sensitivity of the work that you do at the Fed, especially when you're doing policy stuff, we are not allowed to use chat GPT on our work computers like it's banned. And so it's good if you know coding and the things that you would typically rely on chat GPT for when you come in. Otherwise you will probably be quite miserable after you come in.
Martha Fiehn (PhD at UCSD):
That makes sense. Vis or Ashley, is there anything you'd like to add? If not, we'll move on to the q and a. And there's a few questions here about international applicants and I think Vis, but interesting to hear from you in particular. There's someone who says that the Chicago Fed pre-doc is only open to US citizens and other permanent residents. Is that true for Fed pre-doc? There's a question about all else equal, is there a preference for US native or otherwise native applicants? And yeah, just in general if you have any thought, especially perhaps giving the difficulties international people might currently be facing, if you have any thoughts there from this last application cycle or in general...
Vaishali Garga (Principal Economist, Boston Fed):
Start with you. So there is unfortunately a restriction I think across the Fed, and I can speak for the Boston Fed. We're only allowed to hire US citizens. I don't even think we can hire permanent residents, but I'm not a hundred percent sure about that one, it should say on the application page. And I believe the reason for this is every time you hire an international person at the Fed, you have to make a case for why a US citizen could not be hired for the same job. And this case is harder to make for undergrad at the undergrad econ level given how many American citizens show interest in this position. So that's just a binding federal rule. So yeah, I'm an international student myself, so I sympathize with whoever is unable to apply to us because of this. But yeah, this is something we have no control over.
Martha Fiehn (PhD at UCSD):
Eric, do you want to talk about it and maybe also, I know some positions require it when there's sensitive administrative data for instance, but maybe,
Eric Zwick (Professor, Chicago Booth):
Yeah, so I mean other than when working when folks are explicitly working with data that has these kinds of requirements, which some of my work that uses admin tax data does have these requirements, there's a process that folks have to go through to basically submit the relevant information to be eligible for that job. But most of the predocs at Booth don't have those rules. And I work with predocs in some of the research centers at Booth that I also supervise that are not working on that admin data stuff. And yes, that I think from all over, but the policy is definitely above our pay grade to decide. So whatever the rules are today, I'm not sure, but definitely read the applications carefully or the job postings carefully to make sure you're eligible given how the rules have been moving around.
Martha Fiehn (PhD at UCSD):
Great. There's a question here about cold emailing. It says if there's no official posting, would it still be valuable to reach out to professors at research University asking if they have anything? I'll answer this and then you guys can say also what you think. Yes, absolutely. I did a bunch of cold applications and I got a bunch of affirmative responses. I got some no responses, but I had a really high success rate with it. I would say there you do want to be careful with match. If you can demonstrate in the email already, which should not be too long, why you're a good candidate, then they might take a look. I know for one of my PIs, I was the one who screened the incoming emails and then forwarded it to my PI if I thought, Hey, this is someone maybe worth looking for Alvise and Ashley and then Eric and Ali in that order. Do you want to talk about cold emailing?
Alvise Scarabosio (Research Fellow at SIEPR):
I don't think I've ever done it, so maybe I'll just leave the floor to everybody else.
Ashley Pandya (Research Professional, Booth):
I also don't think that I cold emailed for pre-doc application specifically. Maybe primarily because there were a lot of open positions already readily available on pre-doc dot org. It didn't necessarily feel like I was missing out on anything. I can kind of speak to though I ended up cold reaching out to other pre docs to ask their experience. I did that through LinkedIn, some network connections. I would say that is extremely helpful. I think folks are actually willing to give maybe a little bit more of a candid response about working with their pi, working in that environment, working in a pre-doc in general than you might expect. And so if there is somebody who is working at an institution or in a field that you are interested in joining, I definitely say reach out to them. Folks are generally quite warm and willing to share their experience.
Eric Zwick (Professor, Chicago Booth):
Probably, I mean for better or worse, a couple of these a week. So maybe like a hundred or more a year from either undergrads, international students or potential pre-doc. Pretty hard to process all of that. The fraction that I tend to mostly look at ones that are from the Chicago community to see if there's any part-time stuff that I can get them to do because if there's really good fit, and I'll just look me basically at the resume. So a really long email that has, I read your paper, it was really interesting, but that's low value or maybe even negative signal because usually there's no content in that. It's just, I really like your paper title X. And yeah, it's just sort of hard. That's why we have these postings for jobs and stuff. It's like, so we have a system for processing this information because sort of overwhelming to receive that much incoming. So I don't think the cold emails, I mean, I don't know, probably less than 1% success rate, but for somebody who has some connection to already, I think that increases the success rate a lot. So within the academic community that you're already in. But that of course creates an insider outsider problem. But that's why we have this application process to try and sift through the larger pool of potential people and find good matches.
Martha Fiehn (PhD at UCSD):
I imagine you don't get as many cold applications.
Vaishali Garga (Principal Economist, Boston Fed):
Yeah, no, I wish a hundred people were telling me, I read your paper titled x. I love it. No, you
Eric Zwick (Professor, Chicago Booth):
No doubt.
Vaishali Garga (Principal Economist, Boston Fed):
I don't get that. I get a few of these because we do outreach talks from the Fed, so we'll go to universities and campuses and talk about the work at the Fed and then people will be like, can we come and work with you other opportunities? And for the RA position, it kind of point us to cold email. Everything goes through that official process. I have no ability to hire anyone independently outside of that process. But maybe for people who are here, if it's a few, there is also an internship program if you're interested in a shorter term position. So those will happen in the summer or in the winter, like 12 week programs. And it's a bit different from the RA program, but there you work one-on-one with an economist and maybe for that the cold emailing could help, but that's a different position. It's like a short-term position. You kind of work on one task, one project.
Martha Fiehn (PhD at UCSD):
Thank you. Okay, so we'll do one or two more questions. There's a couple in the q and a that are asking basically, why am I not hearing back from any positions? I feel like my cover letter summarizes my skills and research properly. I use a generic cover letter for all applications with some minimal changes, but I've applied to 60 to 70 pre-doc so far and I couldn't even crack the first round. So two applicants who maybe have applied to many positions. And Ashley mentioned you applied to lots of industry positions and it was getting frustrated. What would you say? Yeah, maybe starting with Ashley and then
Ashley Pandya (Research Professional, Booth):
Yeah, I sympathize. Can you actually repeat exactly what the question was? Is it what do I do?
Martha Fiehn (PhD at UCSD):
Yeah, sort of. Sorry, I was synthesizing, trying to synthesize a couple at once, but basically why am I getting rejected? What am I doing wrong? And we don't see this individual's profile per se, but perhaps some, I don't know, words of encouragement or general advice.
Ashley Pandya (Research Professional, Booth):
Yeah, I would definitely say that it is tough. It is definitely not you speaking to people who are my current coworkers who are successful and working as a pre-doc. Lots of us apply to a lot and lots of us didn't hear back from a lot and it takes a certain amount of grit, especially since right now the pool is pretty oversaturated. So it is not just you. I would definitely suggest to get a more individualized response. Again, like I mentioned, reach out to people if you don't feel like you have a super robust advisor program maybe in your undergrad or other people that you can reach out to who would have solid answers. This pre-doc dot org is a really good resource. Ask any of us, feel free to reach out to me if you have any specific questions. Maybe I can answer it. I would also say too, if you are applying this cycle, as far as I know, very early, I don't think I started applying right around now and I don't think I had gotten a final offer until end of November or something. So it is a decently long application process. And I will also say, actually, I'm remembering this now extremely fondly, which is I think the first five responses that I had gotten maybe a month after I finished starting my initial application process. Rejection, rejection, rejection, two of which ones that I really wanted. So it'll get better at some point, I promise
Alvise Scarabosio (Research Fellow at SIEPR):
If I can add something to that. I think this is a pretty long game in a way. I think a lot of the big labs maybe have applications coming out right about now or in the fall, but some positions will open in May or something. Some PIs will find funding very late. And in my experience, my masters classmates, some of them got a job very, very late. So I think don't worry too much, even if at the end of fall you don't have an offer, I think plenty of people do and there's plenty of opportunities that continue coming out through the spring. I know in my experience here at SIEPR, I know that I think last year got like 650 applications and I think we had 20 people coming in this year. So it is highly competitive. And in terms of what you can do about your materials, I think Eric, I think mentioned multiple times, make sure it's not sloppy, you don't have typos and things like that. Make sure you're a good fit. I think seeing people doing applications, sometimes part of the application review is done by the pre-doc. So what I've seen is a lot of PI tell them like, oh, maybe if you don't have the strongest transcript, make sure these people really like what I do or show some interest in my field or in my research. So I think maybe personalizing your application to fit whoever you're applying to could be something good as well.
Martha Fiehn (PhD at UCSD):
Well, great. Thank you so much. We are out of time, but I really appreciate all the panelists coming in today and hope that it was a useful experience. So yeah, thank you so much. I'll hand it back over to Steven.
How to Succeed as a Pre-Doc
Alex (PhD at UC Berkeley):
All right, let's get started. So just to introduce myself quickly, I'm Alex and I'm a second year PhD student in the economics department at UC Berkeley. And the goal of this second panel will be to give some general tips and advice for how to succeed as a pre-doc and give you some more information on what the day-to-day responsibilities of pre-doc are and what that routine looks like so you can have a better idea of what you're getting yourself into as you're starting to apply for these positions. Hopefully with the great advice that Martha and the other panelists just gave you. So hopefully we have all of our new panelists spotlighted here. Maybe can you each just go through and give yourself, give us your name, title, institution, any other information you want to share.
Samuel (PhD student, NYU):
I can go first. I'm Sam. You can call me Samuel if you want. I'm an incoming student at NYU Econ and currently pre-doc at UChicago.
Adi (PhD student, Brown):
I'm a second year at Brown. I did my pre-doc at Wharton and I'm interested in development and behavioral economic.
Rodrigo (PhD, UC Berkeley):
Hi, I can go next. I'm Rodrigo. I'm a first year starting now PhD student at Berkeley, and I'm originally from Latin America from worldwide, and I did my pre-doc in Switzerland under the position of David Donaldson.
Chris (Assistant Professor, Booth):
Hi everyone, I'm Chris Campos. I'm an assistant professor of economics at the Booth School of Business, the University of Chicago, and I'm a labor economist and focus on the economics of education primarily.
Alex (PhD at UC Berkeley):
Awesome. Thank you guys. So I think one good place to start would be for the former pre-doc in the panel. Given that I think all of you're coming straight from a bachelor's or master's degree, when you started the pre-doc, what did it feel like to start a pre-doc coming out of school? Was it a big adjustment coming from being a student all day to having a job that is, it becomes more responsibilities than a student, but is still very much in the academic sphere, unlike the private sector? How did it feel to adjust to being a pre-doc coming right out of school?
Samuel (PhD student, NYU):
I can say for myself, just coming straight out of school, it felt really odd not to have a grade on my work, and you have to be the one that decides how well you're doing at your job, which is annoying at first. But the good news is that not everything needs to be done to perfection. That would be kind of unkind to you and inefficient for your project to be a hardcore perfectionist by every single thing. If you mess something up, the damage is not permanent. You don't have an F for the rest of your pre-doc, you can redo it and mistakes happen all the time in research. And finally you can meet with your PI and discuss how you're doing, how you can improve. Not everyone is going to provide constructive criticism right off the bat, so you want to make sure to demonstrate that you're coachable and humble. But yeah, in summary, it was weird to kind of not have grades at first, but it was important to grow in that way and definitely things you can do to, I understand how well you're doing, talk to your PI and yeah.
Adi (PhD student, Brown):
Yeah, I agree with Sam. It was an interesting experience
To not have as much feedback day to day. So trying to figure out keeping yourself honest and how good you're doing and how much you're doing as new challenge as well as keeping a schedule nine to five, it's a big adjustment just as it is for every new job, I'm sure out of school. But figuring out how to keep that schedule and communicating with your PIs is I think probably the biggest adjustment. Making sure that you're keeping them on track with what you're doing and making sure that the project is moving forward at a rate that's not killing you, but also you're still being an asset to the PI is a challenge that
Is new to the job.
Rodrigo (PhD, UC Berkeley):
So from my experience, I didn't do the pre-doc right after college, but I can say that it's kind of pretty much I would treat it as any other job. And for me, there are two things that are super important. One enters the job market and starts working and that frequent from school is that first you need to kind of get out of the mood that you're being graded all the time and that you are having to take exams all the time. So for example, if you're a PI or supervisor asks you, okay, what are we estimating in this situation and why this estimator al for example, they're not grading you, they're not trying to give you an A or a B or a C, they just want to understand what you did and you also will want to know to understand what you did as well.
I think that's the key difference. You have to stop feeling that you're being graded all the time and answer truthfully and to the best of your knowledge and kind of build up on that answer and try to seek the truth of what you're doing. And on a second point I would say you have to always seek feedback. So since you're not being graded all the time, you don't know exactly what you're doing on a day-to-day basis. So it's always good to try to seek feedback. It doesn't have to be from your pi, it can be from your colleagues, from other pre-doc, from friends that know kind of what you're doing. Always try to get a sense of feedback of your work on your decisions. I think that's always super useful and it will help you the most. I would say the feedback that you get from peers
And your supervisor will help you grow and understand things in a better way.
Alex (PhD at UC Berkeley):
Awesome. Thank you for all of that. So I think in terms of what being on the job is like and the daily routine, as we discussed in the last panel, one of the major things that PIs who supervise pred docs are looking for are a variety of coding, econometrics, technical skills. Do each of the current pred docs want to talk a bit about what your daily workflow is like, what programming languages or tools you're using the most often, and Chris, if there's any particular technical skills that you really value in a pre-doc and tell us a bit about how you typically work with your pre-doc.
Chris (Assistant Professor, Booth):
I can go ahead and start in terms of skills. I mean I think most of the applicant pool that I come across, they have some pretty decent background in Python or R, so I think that's pretty standard now. Less so Stata. So I think that's a bit more varied. Economists still use Stata a lot, so if you do have some background with Stata, that's obviously great, but Python and R are probably the main things that I particularly look out for. And having taken some CS classes is usually helpful. But with that said, I don't entirely just screen on that. I think there's a lot of learning by doing on the job and not everyone has the opportunity to take CS courses in undergrad. They may have found the path to economics a bit late, so there's still a lot of scope to contribute even if you don't come in with a strong coding background.
So yeah, those are the skills, but I think besides just technical skills, I think the thing that to me stands out the most in interviews are students or applicants just kind of can convey a strong eagerness to just learn and continue to just learn. So you have to be passionate about wanting to learn and solve problems. It's okay if you don't really have the coding skills. If you have that passion to want to learn and continuously improve your skillset, then you're going to be a great pre-doc. But if you kind of don't have that passion, it's always going to feel like work. Having to go learn this new skill is kind of this tedious kind of maybe not so fun job. But if you have a passion to want to figure out problems, answer problems or figure out solutions, this is never going to feel like a job and you're going to be naturally a good fit for this type of work.
Samuel (PhD student, NYU):
I can definitely echo what Chris is saying, obviously have pythons data or have that whatever the job description calls for, try and have that ready to go, but more importantly, know how to solve problems in general, like be a good problem solver, learn how to navigate stack overflow or coding subreddits or online coding forums. Chat GBT is helpful, but really you need to be able to patiently read through forums and independently understand people's comments and advice and then be able to take the recommended solutions and apply them to your situation. This is a skill which many people including me need to work on. In my opinion, research is a cycle of getting stuck and then trying to get unstuck. And so if you have that passion for learning, passion for wrestling with problems, I you'll get familiar with that cycle and you can make up any kind of coding deficiency just by being a very passionate problem solver.
Rodrigo (PhD, UC Berkeley):
Yeah, I'm building on what Samuel said.
I would add that learning a language or learning a software is going to be software specific. So either you learn python or data that's going to help you in a general sense, but it's going to be specific to that program. But learning how to be a pro source is going to help you across all the softwares that you can use. I think that's a generalized skill that goes beyond the specific software that you're using. And most of the time the language that you're going to be required to code on, it's going to depend on your PI was posting the job. In our case, or at least in my case, everything was built in Stata, so we stayed a lot and then we went to Python to web scrap or to art to run some simulations. But everything was mainly done in STATA. And I would add that good skill have learn to or knowing how to use Gub and glab because nowadays I think research has be reproducible and yeah, the way you can do that with our data, it's by storing all your code and files on a server or Glab or Glab that allows you to go back and review the history of your code and track down possible errors and stuff.
So I think that's also a good add-on to have, and at least in our recruiting process, we didn't put much or we didn't make much of a focus on the language per se or on the coding skills per se. Of course there's a data test and according test, but the test, what it really is looking for is for the applicant to have super attention to detail and be also aware of what you don't know. I think those two things being super attention to detail and also being aware of what you don't know are things that are going to get you far in the free job, especially the second one because of course BIS don't expect you to know everything that you're going to ask you, but they would like to know. For example, if you're comfortable with something then it's okay, but then if you're not and you are unsu still, you should always feel free to ask questions and also you should be aware that you're not what you bring us to do.
Alex (PhD at UC Berkeley):
Great. I totally agree with, sorry. Yeah, go ahead.
Adi (PhD student, Brown):
Sorry. Yeah, I agree with everything being said, I think when it comes to languages, it just depends on your pi. One of my PIs, everything was in data and the other one, everything was in R and Python. So it just kind of depends on what you're being asked for. Another skill, I think it's very important, which I think I underestimated at first is understanding, making sure everything is organized and clean and that your file structure is always as clean as possible because you have to remember this is one of eight or maybe more projects that your PI is working on. A little confusion will take a lot of time out of their day that they don't want to have to deal with. So make sure everything's super organized and file structure is always great and clear is very good. And then also just making sure that you understand the economics behind what you're doing. If you know how to run regressions but you don't actually know what they're doing, that's not very helpful. So making sure that you also are being a good economist and not just being a good coder.
Alex (PhD at UC Berkeley):
Yeah, I think that's all great advice. Particularly agree with everything that was said about how being a pre-doc involves a lot of learning by doing. I know in my two years as a pre-doc at Chicago Booth, I definitely over the first few months, but honestly the entire time I was having to solve new problems and having to figure out how to code up things that I'd never tried before that I didn't really understand until I had to start working with them very hands on. And I think being willing to admit when you don't know what you're doing and to learn and go search for new answers like Samuel mentioned is really important. So now that we've talked a bit about hard skills, I think it's also worth covering what kinds of soft skills, communication, time management, et cetera are important to having a good experience as a pre-doc. So for Chris, maybe you can tell us a bit about your management style and how you interact with your pre-doc and everyone else. It would be helpful to hear what are the other important elements of being a successful pre-doc that you think are worth mentioning and do you have any advice for managing up and building that relationship with your pi?
Chris (Assistant Professor, Booth):
Sure, I can go ahead and start. So one thing I make sure to tell all the pre-docs that have worked with me is they don't teach you how to manage people when you get a PhD. So I think it's not something that we're all naturally good at. So it's definitely something that I've been evolving with In terms of my management style, I think I've converged on something that has been working and so I communicate with my pre-doc at a minimum three times a week because we have three meetings. The first meeting is on Mondays where the purpose of that meeting is for each pre-doc to basically on their own kind of establish what they think their goals for the week should be. And the reason why I put this on them is because a lot of students when they start the pre-doc, they've always been told what to do, here's your homework assignment, you have to do this and don't really, they're not too good at just kind of working independently or kind of guiding the direction of their work.
And it's something that you definitely have to be good at when you go to the PhD program and start working on your own research. So I thought it's a good idea for them to start practicing that. Obviously in the first few weeks or the first few months, they don't really know what their goal should be, so there's definitely a lot of direction I have to give them. But over time, as they kind of adopt and are working on several projects, they kind of know how to prioritize different tasks that are forthcoming. And so it's usually what a Monday check-in is. And then we meet Tuesdays and Fridays where we'll have kind of a lengthier check-in where everyone will just kind of for 20 minutes talk about what they've worked on since we last checked in, clear out any doubts, stress, any issues that may have come up.
And then in between then we communicate via Slack. So I feel like I am actively communicating with them, definitely Slack 'EM at least once a day about something. So there is constant communication and I do try to encourage them to feel comfortable reaching out and asking me questions, especially early on where we're kind of getting to know each other and how each other works. And so I think that model is working. It's been working for a while, so I'll probably be keeping it. We also kind of use Asana and in Asana is like where they kind of maintain all of their tasks and all of the goals that they're kind of striving to achieve in a given week, but also more on the long run. They also summarize what they've done at the end of the week. And I think the end of week summary is very helpful just because in the PhD program there'll be periods of time where you're just working on your own research and maybe a month or two will pass, but you won't really have much to show for it.
And what was very helpful to me in the PhD program was kind of maintaining a research journal where even though months may have passed and I don't really have much to show for in that journal, I have a lot that kind of really shows and demonstrates that I've actually done a lot and maybe there may not be any particular results that are encouraging or that I'm happy about, but there is a lot of human capital I've obtained along the way in that research journal was very helpful to me. And so the analog to this is kind of the end of the week summary where everyone just kind of reflects on what they worked on, what they achieved, what gets punted into next week and things like that. And so over time the pred docs learn to start directing how their attention should be devoted to different tasks. Obviously if I need to intervene and say actually prioritize this, I'll do so, but that's the general flow that I've been having for the past year and a half or so.
Samuel (PhD student, NYU):
I can just say the strength of pre-doc relationship with their PI is very much proportional to the communication. Provide clear updates on what's happened and what's coming next. Be honest about what's not working because they can help you and they can help you a lot more than you think. It doesn't matter if they don't look at the data as much as you do or code in the same language as you. You'd be surprised by how much they can help.
Adi (PhD student, Brown):
Yeah, I agree. Communicating
Is incredibly important. And I think another thing that is important about building prevailing relationship is making sure that organization works. You guys are organizing tasks in a way that makes sense. I think one sort of skill I had to learn when I first started, because in undergrad I pretty much kept everything because in my brain I didn't really write things down, I didn't have a planner, but that doesn't really work when you're being a pre-doc. So like Chris was saying, they're not professionals at managing people. So trying to communicate with them to build a system that works for both of you, I think is a very important step. Try to figure out early and then continuously communicating about problems that come up or even just about your research and your ideas. I think PIs are looking for people who can contribute intellectually and improve papers, not just get them done, but improve them in some ways. So working to try to do that and having the confidence to do that
I think are both very important for building a relationship.
Alex (PhD at UC Berkeley):
Cool. Rodrigo, anything to add or should we move on?
Rodrigo (PhD, UC Berkeley):
No, I mean I didn't want to be repetitive, but maybe one thing that I remember Dave told me in my first day at the job was that you know how in industry there's this saying safety till make it. I would say in academia and in the Joe, what he said is don't fake it, you'll make it. Just never fake it it ever. And be open about, as I said before, with things that aren't going well and be open to being authentic and communicating that things aren't going well. So there's space to improve and work on it. So for example, if you find a mistake on your code that you did a month ago, don't try to cover it up. I would say, for example, if you find a mistake, no, no beforehand that your PIs are going to be glad that you find a mistake and that you can fix it now instead of you being scared about how you made that mistake and then yeah, not saying anything about it, I think that's important. And then this might be kind of a cliche advice, but just be honest and be yourself. I think people pick up on the authenticity of who you show you are and I think that's going to help build their relationship with your PI and with your peers more and more.
Alex (PhD at UC Berkeley):
Yeah, I definitely had a lot of experiences in meetings with my PIs as a pre-doc where I had to admit I set that thing up to run overnight last night, but it didn't work. Or I found a bug in something I wrote two weeks ago and they were generally very receptive to that and saying, look, we're glad you found it. Tell us these things upfront. We just want to make sure that the final results end up being right one way or another. And I think being honest about what's going well and not going well from the start, it's very important to establishing good communication with your pi. So now that we've covered a few of the most important things for pre-doc and their day-to-day routine, is there anything we've missed? So preds here, any other on the job skills that you think are really critical and for Chris, anything else that you're looking for that we haven't covered yet? What are the hallmarks of a pre-doc that make you really want to write that all important recommendation letter for them?
Chris (Assistant Professor, Booth):
I guess I can go
Ahead and start. Yeah, it's kind of very similar to what I started with in that we'll quickly be able to pick up on how passionate you are to just learn things and how curious you are about whatever it is you're working on. And just really having that curiosity and that passion to learn really is going to make you a successful. And so that kind of are passionate about what they're doing, don't really view this as a job. Some transactional kind of 40 hour a week thing are usually the pre-doc that I think we're going to be happiest to write a letter about just because they naturally enjoy doing this type of work. So that's
Adi (PhD student, Brown):
I think one thing for me that took a while
For me to figure out is don't be too nervous to show that you don't know something or don't understand something. You have to be willing to maybe look a little dumb sometimes for the sake of learning. And if you don't try to show this passion for learning, like Chris was saying, you'll never make a great impression and your letter won't be that great. I think what makes a good letter is showing that you are extremely passionate and curious about this and have the beginnings of interesting ideas. So you have to be willing to be wrong occasionally or look maybe a little silly occasionally so that you can learn from those mistakes, but also
Show that you have that passion for learning. Cool. That's super
Alex (PhD at UC Berkeley):
Helpful. So for the current pre-doc, if there's anything you want to add on what surprised you about the pre-doc experience or what you struggled with the most, that would be helpful. If not, we can move on to a couple of other things about during the pre-doc and afterwards, is there anything you would add on what challenges you faced or what obstacles you see pre-doc running into frequently?
Samuel (PhD student, NYU):
If I was going to restart, if I were to talk to myself, the version of myself two years ago when I was just starting my pre-doc, I would just say to that new pre-doc to do your best to be confident in your abilities and don't be hard on yourself. When you make a mistake, you will mess something up and that will feel embarrassing, but just do your best to be confident and let it go and that will help you a lot.
Rodrigo (PhD, UC Berkeley):
So for me, I think that the most challenging thing that I encountered doing my pre-doc was, so we had a similar but different set up as expansion. So I was working in two projects and in these two projects every week there was a PI call in which I was responsible to present what I had done during the week and show results and explain what their results looked like, that, et cetera, et cetera. So every time that I had one of those calls, I was super nervous because there were three or four super experienced PIs, well-known PhDs as professors, et cetera.
And also it was different than, I don't know, for example, presenting a thesis where you kind of are the person that knows everything and knows the most about your project. This of course is not necessarily your project, so they know better than you, but you have to be confident in what to know you're presenting and be confident on what you did was the right approach and justify why you choose that, why you choose an another thing and kind of be open to that conversation and to resume feedback. And for me, those kind of presentations were the most challenging. At the beginning I was always super afraid, but then that fear grew less and less over time. And then yeah, at the end I became comfortable with it. But yeah, when I was super comfortable, basically my preDoc was over, so I couldn't exploit that during the pre-op, but I hope that I can exploit that during grad school.
And finally, I would say that I had a great time during my predoc, I'll recommend it to anyone that's looking to do a PhD clinic on other social sciences. I would say it's a great experience and for me it was kind of how I would describe it is that you get to play to be a researcher before you actually are because you get to work on a lot of cool things, but you don't have the responsibility of those cool things. So you don't have to manage people mostly you're not responsible for mistakes it you get to play first before you actually become a researcher. So I think that's cool. And yeah, I would say at the end of the day if you are deciding or you're set on doing grad school always was kind of like a needle move or a middle step towards that.
But I will make sure that you enjoy that year or two years of your pre-doc because it's two years of your use. And yeah, don't stress it much. I would say if you do the right job from nine to six or whatever your working hours are, it's going to be probably fine and make sure to keep enjoying life outside of during those working hours as well. But outside of those working hours especially and get also the chance if you are in a program that has a lot of other predoc, get the chance to socialize with them to become friends with them because those connections, especially once you finish the predoc and start diverging into other schools all around the world, they're going to be important and you're going to then have a lot of people that are within the profession and then that are all over the world. So that's kind of cool and you start building connections super early on and I think that's also invaluable, unusually hidden asset that I think Predoc have.
Chris (Assistant Professor, Booth):
I'll add one thing on the presentations this bit in the weeds, but when you do have to present in front of your PIs or co-authors of PIs try to make some slides or something that can kind guide the discussion. But most importantly, even if you can't create some slides, look at the results before the meeting and don't present anything that you just made right before the meeting. Some of the most frustrating moments can be when someone sends you some stuff that they worked on and you open up the document and the first number you look at is clearly wrong and that means that there's some mistake. And that's usually happens when someone just sent you something in a rush, they didn't have a chance to take a look at the results. So that just ends up being a waste of time for everyone. So be comfortable saying, look, I wasn't able to get this in before the meeting, it's fine. That's much better than discussing some results that are probably going to be some nonsense. So this is my 2 cents on that.
Alex (PhD at UC Berkeley):
That was definitely something that took me some time to learn, to figure out how much information I should put in the memos or other meeting agendas that I was sending to my PIs to realize how much detail do I need to give you so that you can make an informed decision without overwhelming you with the information that's not super useful to you. And I think as a pre-doc coming straight out of school, that's something that took me a while to really calibrate for myself. So now that we've heard a lot about what the day-to-day workflow is like of being a pre-doc, I think it's also worth mentioning that there are a lot of side benefits that come with being a pre-doc, such as being able to take classes at your institution, go to seminars, get advice on grad school applications, et cetera. So would each of you want to say a bit about what those sort of side benefits and opportunities are like at your institution and how you might've taken advantage of them during your time as a pre-doc?
Adi (PhD student, Brown):
Yeah, I would like to say every seminar that you could go to, I recommend going to, even if it's not what you think is your main interest learning, I think that the main thing you get is learning what a good presentation looks like and what bad presentation looks like and learning that skill early is incredibly important for grad school. Already through my first year I've noticed how much it's helped, but it's also a great way to start thinking about where the frontier is and how you can sort of build on that. So I definitely would recommend going to as many seminars as you can fit. I mean don't go to too many and let your work slip, but probably three, four a week during lunch, something like that you could probably do without losing too much. And then classes I really valued taking and even auditing a couple of fueled courses if you're behind, if there's a couple of things on your transcript that you should improve, I needed to take a linear algebra class, so I did that. But I think taking field courses is also extremely helpful so that when you get to grad school you hit the ground running.
So I definitely try to take advantage of those things if you can. I was at
Samuel (PhD student, NYU):
UChicago, like I mentioned earlier here we can take classes and get financial support for that almost always a hundred percent paid for. So that's an option. It should be an option at other university too. I chose not to, I did a lot of school in college and I wanted a break from the classroom and that really helped me to better magnify my responsibilities of the pre-doc itself. I think the best resource available here at UChicago was the seminars you get to explore different research topics and once you, I mean I recommend going to a wide variety of seminars, find out if you're interested in econ for real or maybe you might be interested in something that's related to econ but not classically defined as econ such as accounting marketing or operations research or other topics. These are all seminars that are offered at Chicago Booth Business School, which is where I'm at. And I imagine they're offered at other business schools, other university pred docs because the hardest part of the PhD application is figuring out what you want to do in your PhD and figuring out how to phrase that in your statement of purpose. So highly, highly recommend taking advantage of the seminars and exploring your research interests through that.
Adi (PhD student, Brown):
And I can add to that,
Rodrigo (PhD, UC Berkeley):
I mean this might be obvious, but I think the biggest advantage of a predoc is getting familiarity with what research looks like on an earlier stage before you commit and five to six years to your PhD. I think that's the most advantageous thing that a predoc has, especially because for example, using my case when I started my pre-doc, I started work on a paper that was born 10 years ago and we submitted that paper and then it got rejected and then we submitted to another journal, then you got an RR and then got a secondary on hour and I left that paper, but it was to almost barely over two years that, for example, I worked on that paper. And I think getting to know that aspect of research that is kind of slow in a way that you always work on the same paper for two or three, four years, that's us being super different than for example, working in consulting or working in the private sector where the environment is much more fast paced and you get to turn on things and then those things disappear and then you work on another thing and another thing and another thing I think in research, a lot of the time the work is more repetitive and more deep and you go deeper and deeper into the same paper sometimes.
So I think getting exposure to that academic environment and how academia works in that aspect I think is super useful and it will help you decide whether you want to do that as a profession or not. So for example, I didn't have a problem with that and I like research, so I decided to pursue a PhD. But for example, one of my friends didn't like that pace and she decided not to do a PhD. So I think it's a good environment for you to get to know what research looks like and decide whether you want to pursue a PhD or not and a research career or not.
Adi (PhD student, Brown):
One other thing I would recommend actually, depending on where you are, I was lucky enough to have a pretty good relationship with the PhD students at Wharton while I was there and I would definitely recommend trying to do that if you can and even joining some reading groups. Sure that for the most part they'll be happy to have you unless there's your getup huge pre-doc group and there's hundreds of you that maybe not so much, but I'm sure that a lot of them would love to meet you and chat and sometimes they can give you different advice than your PIs. If your PIs are more senior, they might not really have the experience with some things as they are now. The market's changed quite a bit. So
I do recommend doing that if you have the opportunity.
Alex (PhD at UC Berkeley):
Thanks guys. So I think Rodrigo gave me the perfect segue into one of our final questions, which will be about coming into the pre-doc, being unsure whether a PhD is the right path for you and in general paths after a pre-doc. So as Steven mentioned in the introduction earlier, probably the most likely outcome for most pre-doc and most institutions is that they will end up going on to PhD programs in economics or related social science. But there are certainly a good number of people who go through the pre-doc to test the waters of whether they'd research and then figure out, you know what actually maybe I should have done, I can do something else after the pre-doc and not say in academia, which is a totally valid outcome. So what advice would you give to people who are not yet sure if grad school is the right for them? Do you think a pre-doc is useful in that case and can it give you helpful information about whether research is the right fit for you?
Chris (Assistant Professor, Booth):
I mean, yeah, I think for sure if you're unsure doing the pre-doc is really going to give you the clearer signal, but whether if this is for you or it's not for you, you're going to basically get into the weeds of what it's actually doing research that's at the frontier of economics. So yeah, definitely if you have slight interest on the margin, definitely do it. It could turn out that you're not interested and that's totally fine. I think there's no one looks at you any differently if you decide to do something else. If anything, we're happy for you and just want what's best for you. So
Yeah, that's my take. Yeah, definitely. And I would add that
Rodrigo (PhD, UC Berkeley):
Even if you decide not to do a PhD during the pre-doc, you pick up a lot of invaluable skills in that process as well. So for example, from my predoc, got a lot of skills that will be translatable to the private sector or to any other sector. So for example, project management, managing other hourly arrays, presenting skills. I think these type of things are really generalizable to any type of job. So yeah, you're going to spend one or two years trying to figure it out, but even if you decide not to do a PhD, I think you'll gain a lot of valuable insights and skills that will
Translate to other things that you move on to life.
Adi (PhD student, Brown):
Yeah, I think one thing I would also say is
I know some perhaps try not to infer from your preddoc experience what grad school will be like. I think if you're enjoying your pre-doc, you'll probably enjoy grad school, but sometimes you just don't click with your PIs. Sometimes the projects that you thought you'd turns out aren't actually your favorite topics. So I think seminars or and meeting with the current students and stuff can be useful for. So I mean I was lucky enough to really development so I stuck with it, but if I didn't like development, talking to other PE students and coming to other seminars would've just really helpful to sort of find what my interests actually are. But yeah, sometimes just don't click with your PIs. I think I kind of clicked with one and the other one kind of not so much. So just because you don't like your pre-doc doesn't mean you're not going to like grad school, but if you don't sitting down and coding and something like that, then
You probably won't like grad school. It is a lot of that.
Samuel (PhD student, NYU):
I have nothing to add here. I second everything that was said.
Alex (PhD at UC Berkeley):
Awesome. So that was in large part my experience as well. I went into the pre-doc trying to be very self-aware about saying, I think I'm on the fence about grad school right now and I will use this time to figure out whether I actually doing research and try to immerse myself in academia as much as possible to decide is this actually for me? I came out of it saying, you know what I do, kind of staring at my Python window all day trying to debug why the disease results don't look the way I'm expecting them to look. And I enjoyed that. I enjoyed trying to start being a more active contributor to the project I was working on and I think for most people it will give you a relatively good signal of is this career path for me and at least am I willing to commit to the five or six years of a PhD. So I think that's pretty much all of the questions I had prepared before we had to answer any questions from the q and a in which I can see there's only one or two right now. Is there anything else we've missed? Any other advice you would give to someone who's starting a pre-doc right now?
If not, that's fine by me. We've
Covered quite a lot of good suggestions already. So I see just a couple of things in the q and a, I'll try to cover those. We have one for Rodrigo. What was it like for you to apply as a Latino with the barrier of not attending a top 10 university and a language barrier and how did you overcome that in your pre-doc applications and interviews? Do you have anything specific to share about being an international applicant to pre-doc? That would be interesting?
Rodrigo (PhD, UC Berkeley):
Yeah, so I think my experience was, well, particular in the sense that I know some people applied, I know docents or yeah, I dunno, hundreds of predoc, but I did apply to a few and I think I was kind of lucky in the sense that I was chosen for the predoc by my supervisor. And yeah, I would say in general I think that experience helped me a lot in the application process for the PhD. I think without that experience I wouldn't have gotten where I'm at right now without that experience. So I think it worked wonderfully for me. And I dunno, language barriers, I would say I learned English since I was young. So I would say that understanding reading, writing English was an English field, but I think it also helped me a lot to speak parent English because even though it was in Switzerland, everything was in English and we always speak in English. And yeah, I think that also helped me a lot in not only communicating English and conveying things in a way that are more natural now, but also, yeah, it helped me a lot with applications in general for the PhD.
I dunno if that answered the question, but that's all I can see.
Alex (PhD at UC Berkeley):
I think that was very helpful. So there's also one question from the previous panel that is worth bringing back up now. So as Eric and VE mentioned in how to apply to pre-doc panel, there's some institutions that do things more as you are hired to work with a particular PI and some institutions like a lot of the fed branches that tend to allocate pre-doc or RAs as a pool who work with several economists during their one or two years as a pre-doc. So do you guys have any thoughts on contrasting what it's like to work in a more pool environment in a lab with several PIs, maybe other maybe PhD students or postdocs whom you're interacting with versus a more traditional RAship where you are working very specifically for one pi? Are there any significant differences and how do you think pre-doc can succeed in each
Of those scenarios? So yeah, mine was pretty sort of
Adi (PhD student, Brown):
Locked in. I had two PIs and I just worked with the two of them, so I don't really know what precisely what the counterfactual was like, but I thought it was quite nice to sort of build this relationship over time. But one thing I'd recommend for people in the same scenario is to not be afraid to reach out to other professors that you aren't working within the department. I'm sure for the most part they'll be happy to chat with you about whatever sort of
Interests you have that is aligned with theirs. I didn't
Samuel (PhD student, NYU):
Work in a lab, I worked for two PIs just every consistently. But I feel like the, no, the lab environments here at U Chicago seemed really cohesive and people, it didn't seem like it was overly competitive, which was one thing I was kind of thinking that it might be bad in that way, but nope, the lab that I knew really enjoyed each other, they supported each other. So if you do end up in a lab, I know that people mentioned some concerns about working in a lab, but I think it really is kind of what you make of it.
Martha (PhD, UCSD):
Alex, if you want, I can answer that question if that's
Okay. Yeah, please go ahead.
Alex (PhD at UC Berkeley):
Okay, so this is something that I really like thinking about. So I did my pre-doc, like I mentioned earlier, both at Harvard and at Yale. And the experience was extremely different because of the ways that the programs were structured. So Yale has an official program with a program director with program support, whereas Harvard, I didn't even know that HKS had other pre-doc, right? We knew that there was pre-doc at HBS, the business school because there were roommates with some of the econ pre-doc. So it was very dispersed. And that means you don't quite have the same sense of community necessarily. It also means that you have to organize yourselves better potentially. So at Yale for instance, there was a pre-doc seminar every week with food and the coordinator would find a professor to give a talk or a PhD student to practice their job market talk.
And at Harvard we kind of made our own seminar, but just in general there's significantly, there was significantly less support and that also meant that if you had an issue with your pi, it really wasn't necessarily clear who to speak to. Whereas if there's a program director that can be a lot easier. You should also pay attention in case, I don't know if this has been mentioned already, but if you have multiple offers, you should look at that structure and look at the precise benefits. So for instance, if you are an NBER pre-doc, then you do not benefit from the same tuition reduction as if you are at that institution. So again, that's kind of a structure that can make a really big difference whether you're paying 500 or $5,000 to take a real analysis.
Martha (PhD, UCSD):
Thanks. I think we've
Alex (PhD at UC Berkeley):
Given everyone a lot of good suggestions about how to position themselves to apply and now how to think about succeeding in a pre-doc and adjusting to that after probably spending most of your professional career in school before starting a pre-doc or RA position. I'm happy to wait a few seconds longer if there are any other chat questions about the pre-doc experience. Otherwise Steven, feel free to take over at any point.
Adi (PhD student, Brown):
Okay, thanks Alex. Let me stump the recording
Stephen (Booth):
Here.
2025 Data Task Review in R
Watch Martha Fiehn, Economics PhD student at the University of California, San Diego, solve a data task in R. Find the problem set here and the solutions here.
Martha:
Great. So we're going to go over the data task in R. I expect that many of you will not have had the chance to look at it, but please pull up the PDF of the task and I'm going to share my screen of just the solutions. Basically, if I make a mistake, it is my mistake. If it's clear and you're sure, please correct me live. If you think that there's a better way to do what I'm doing, send me an email afterward or I don't know if everybody can see the chat, but maybe put it in the chat. And with that, let's see. We can get started and I'm going to switch which screen I'm sharing a little bit.
Okay, so you should be able to see a PDF that says Pre-Workshop Data Task and our best practices. Because what I wanted to start with first is not the actual coding but how to think about coding. Now some of you are going to be super proficient, you're already experts. You're much better than I am and this will be super basic, but some of you might not have that same familiarity. So I just thought I'd go over this a little bit. One of the most important things is that your code should be well annotated. That means other people who are looking at it should be able to see what you're doing and why. And actually this can be super useful for yourself too because you might think that you're going to remember what you're doing and why and then you come back to the code later and you have no idea what am I doing here? So this applies the advice I'm giving in general to also once you have your pre-doc position and maybe less of the data task. So sometimes I write here, I write entire sentences, I might specify the goal. Here, I face this issue with the merge and that's why I'm doing Y thing rather than X thing. And it can be really helpful to keep track of that and it might even be something to include in the data test. That's up to you if you think that it's relevant.
Like I said, it can be a chance for the PIs to show your thinking. As they mentioned, they are looking for passion and interest and these kinds of things. I would say especially with ChatGBT, the relative differences between applicants will be smaller, but how you think, how creative you are is a way that you can kind of stand out. Variable names and file names should be as clear as possible. So like said, you want someone to be able to look at your code and kind of know exactly what's going on to the extent possible. So call it education or I duke or years education or something else, but not variable one. If you have a transformation, so for instance, you're taking the log wage, call it log wage, and then I like to sort of whatever transformation I'm doing, I'll keep on adding prefixes or suffixes until they get too long.
There is a tradeoff between length and accuracy, but I also find that this kind of helps keep track what you're doing and can make it easier to read. Then similarly for file names, graph old and graph new don't really say much, but if you call them wage scatter, then it's like, okay, well it's a scatter plot so we know what kind and we know it's being grafted. You can also do a version control. So rather than putting old and new et cetera, put version one, version two or something like that. This is just advice, you can do whatever you want. Okay, so folder organization, hold on, there's someone with their mic.
We'll see if they need to mute themselves. Lemme try and find them. Okay, I can't see, so we're just going to have to, there we go. Muted. This is also a little bit less relevant to the data task and more relevant to every day, but one way that you want to structure your folders in general is by having a folder for your code, a folder for your data and a folder for your output. And there you can label each folder like, okay, this is the code where I'm building data sets. So you can also call this clean or whatever. And this is the folder where I'm going to analyze the data set. Similarly for the data, you can have the raw data. This is where your CSVs, you can upload and your clean data that you'll be using for your analysis outputs is kind of more self-explanatory.
So you can have figures or graphs and then tables when you download raw data from the internet, which might be required, probably not, but might be a part of some of your coding tasks. Just in terms of data provenance, I think it's really good to keep track of where you got that data from. And so I did this for my pre-doc at Yale. We were looking for government data from different countries and different states and different counties, et cetera. And I was looking at so many different data sets that you don't remember where you got it. So you want to include for instance, a link, this is where I got that data. And then also what helps is a screenshot. So that's how I kept track of it. Again, a little bit less relevant to the data task, but I think important or potentially helpful.
Then similarly if you include a read me file that explains what you're doing, that's a really good practice and like I said, I'm not saying you necessarily need to do everything like this for the purpose of a data task, but just in general then how you write this up. The data tasks are generally going to have a written portion and so lot tech is what we use in economics. If you're not familiar, it's a document type sitting system. It looks like this, you can manipulate it, it can be kind of like a pain. Sometimes you will almost certainly, I would say, well perhaps not. Definitely if you do your PhD, you will be using this probably in your pre-doc you will be using this. So it can be really helpful to familiarize yourself with it advance. It works especially well with math. So the more like theory work you're doing, the more likely that it is that you'll be using law tech. And here, I mean this we're going to do in a second here in the data task, but you can see import the tables and graphs, et cetera. Okay, do we have any questions about anything that I've said so far? Can you use the raise hands function perhaps? And if not, I'm going to move on.
Alright then now you should be seeing my R code here at the beginning. It's purple because I defined it as purple in case you are curious. I spent an entire day just deciding which colors to use and yeah, okay, so the first thing I do here at the top or I like to do is specify what project this is. So this is going to be the same for any kind of file. That's me. What's the purpose? State created last edit. Again, this isn't something you necessarily have to do, but it can be helpful now then you load your libraries and I've kind of put a little explanation of what each of these does, and you are probably already familiar with many of these if you've used R before and you add to them kind of as you go on. Okay, then you do your setup.
I like to usually number things, but for now I'm not going to, and here I'm going to set my working directory. Okay, this can be a helpful step because now I'm basically telling it, alright, everything that we're doing is going to be here in this folder and then you can specify the file paths relative to that afterward. There's a few ways that you can also set this up in Stata, et cetera, and it can be really helpful to have the rest as relative because then anybody else using those folders can just specify their own directory and go from there and won't have to change where exactly the files are. For instance, if you're working collaboratively on a project, right? Okay. Then in general, you don't want to have these spaces I have here just as a hint, better to not have those. If you are on a pc, you're going to want the slashes to be either one forward slash or two backward slashes and alright with that let's import the data. So there's two data sets for the purposes of this, the data task and one is PM 22.5 and the other is ozone. Now I often, if I'm loading a data set, I like to call it the exact file name that it has and then I also like to keep it unadulterated because that way this is more important for really large files because say you make a mistake, there's some transformation you make, you accidentally collapse the entire data. Then you don't have to go all the way back, load everything new to under your mistake.
You can also use the load function here. Last night it wasn't working super well for whatever reason. So this works if you are importing a DTA file instead you can use the Haven package for that and there's packages for CSV and Excel and et cetera, et cetera. Alright, so let's run this. Great. So now the first thing I like to do when I open a new dataset is I like to look at it. So let's take a look here and we see this looks pretty good. This looks pretty clean, right? Okay, and let's look at the other one. Okay. Now does anyone, I assume not, so don't worry, but does anyone notice something about the structure, the relative structure of these two data sets? They have the
Blagoja:
First dataset, oh sorry, go for it.
Martha:
Blagoja. Do you want to go first and then whoever was also talking, you can go second.
Blagoja:
Sure. So I think the first data set that is with the name Ozone has more quantitative variables when it comes to pollution such as the AQI and the ozone pollution. And then the second one has only for PM 25. While I was looking at the data before the workshop session, I also saw that the dates are not in the same order here. They're year, year, month, day on the first one. I think it's month, day, year. So yeah, when it comes to their merging or just further analysis, I think we should make them, I would say in the same format.
Martha:
Excellent.
Drumond:
Okay, so thank you. So please can you go on your, I would like to see your dataset please. Okay, so the first thing that I can see when I'm looking to your dataset is that the variables are labeled. So I think this is a good thing that can help people who are working with your dataset know particularly what we can do with typically if it's valuable or another one.
Martha:
Great, great. Thank you both. You got it exactly right. You noticed that the dates have a different format and grand. Yes. The labeling also super important and that's one of the actually reasons why I like data is that you can label the variables really well. But yeah, for the purposes of this task, if we were to just, well, let's look at the instructions, right? For both data sets, the unit of observation is a county monitor day group merge the two data sets to create a single file, pre-process the data as necessary to facilitate the merge. Okay, so suppose we were just going to, we'll call this ozone PM 25 and lemme use this one. I do left join. Suppose we were just going to try this. Well then it says ozone not bound. That's because it's later. I'm just going to tell you what happens here. If we were just going to try and merge this, we would get the error that the date doesn't match. And that's exactly for the reason that roja, I'm pronouncing that incorrectly pointed
Blagoja:
Out. Yeah, yeah, no worries, no worries.
Martha:
Fine.
Blagoja:
Exactly.
Martha:
Thank you. That pointed out they are not the same format. So now let's harmonize them and this is in general something that you'll want to take a look at. Okay, so here I define this function and I'm saying, okay, the county code, which is another variable that we're merging on, this is basically saying these are going to be the same because if we look here, look county code, what is it? There it is here, it's called one, right? And we can see that this is a county FIPs code. If you're working with US data, US regional data, you'll familiarize with FIPs pretty quickly. Now looking at this one, we see that this is 0 0 1, okay? You also, you want these to be the same before you merge. And these are not numeric data, right? 0, 0 1 doesn't mean there's one unit of this county, it means that this is the label that we give it. So that's why we're also specifying here, put two zeros before and there's three digits in total. Put two zeros before if there's just one digit. And now we're saying I think the site ID is the same, but just in case we're saying that this is going to be a character the date, we're specifying the format here for both. Alright? And then we're going to apply this function to both of the data sets that we have.
Okay, great, that worked and now we should be able to merge them without issue. Excellent. So if we had tried that, if we would've tried that before we would've gotten an error. And in general there's some really basic checks that you might want to do. So for instance here we can see that the number of observations has stayed the same. That's like a good check to make sure that you didn't lose any observations. In general, you'll want to be a little bit more particular to make sure that nothing got lost or changed in the merging process. Alright? Now usually at this point if I'm working on an actual project, I will save this as a separate data set and then create a new file to do my analysis for the purposes of a data task. And the reason for that is like, okay, so I've done this merging, suppose I had to do a lot more cleaning. You don't want to have to do that every time that you create a table or change a table and it's just kind of cleaner to have that segmented. But we're not going to do that here because it's just a data task and not a natural project. Alright, so let's move on to question two.
We're making some summary statistics. So a really basic thing you can do is just say, alright, summarize this for me. We're getting our means and first quartiles, medians, et cetera, et cetera. If you want, you could just put this all into Excel and make a table and put that into your writeup, but I would not recommend that approach. So this is one of the reasons why we like to use law tech. And so instead we're going to do that, okay? It's a little bit more involved and this is one way to do it. Okay? So here I'm making a new data frame with just the variables I'm interested in. I'm renaming them. This is one reason why I like said I like Stata, is you can call the labels more easily. And like I said, there are many ways to do this. This is one way of doing it, it's not the only way.
I'm not even saying it's the best way. Alright? And then we can make a table using this data summary function and here we can see this message is going to get sent once these are the packages that we need in our law tech document. If this feels a little bit overwhelming because I actually didn't cover very well what law tech is, don't worry too much about it. You can export this table in whatever format you want. You can also use a CSV if that's what you're comfortable with, but you will want to familiarize yourself with law tech at some point. So you may as well do it now or before you're applying to, and this is the library that we used for that. So now we have a table and I guess I'll stop sharing my screen or you can just believe me that this got put into the file directory that I specified, which is output tables and then summary statistics. And I'm going to go over what they are because I think you should do that yourself and get it correct. And if you want you can send it to me and I will see if it matches the answers. I believe the answers will also be posted.
Okay. And now question three says to produce a table with the same statistics for just ozone, but split the sample by source variable AQS versus now in other words, your table should report the mean min, et cetera for a portion, blah, blah, blah. Okay, so now I'm using a different way to create this table, which looks as follows here I'm saying question three summarizing by source. And like I mentioned, I'm just showing you different ways that you can do this. There are other ways I, again, making a data frame here, I'm summarizing the values. I am also, I'm adding this row of totals just to match the data code as well, right? Because there's also a simultaneous data code session and I wanted to get the answers to look more or less the same. They're not identical. So we run this code great, no errors, that's always helpful and now we want to save it as a latte table.
So we do this and here what I'm demonstrating is I am using these two packages because like said, there's different ways to do it. Now I'm trying to make a judgment call for those of you who are familiar. Okay, so one thing about law tech is that it can be a little bit finicky. So for instance here, I'm specifying that in my law tech document, I want there to be some vertical space between the table and the title. And I'm specifying this here because in general you want things to be as replicable as possible. So especially once you're a pre-doc, you will have to run the same code over and over. The PI is going to be like, okay, you have this regression, now add this, you have that regression, okay, change this, change that. And if every time you manually have to update the table, it just takes a lot longer.
So just in general, as much as you can automate that, it's the same or works do that, of course there is sort of a trade-off. You don't need it to be perfect every time. It doesn't need to be publication grade every single time. That's not probably a good use. Alright? And you can see once again here, that's where the table went. Actually I am going to share with you what that looks like, just so that it's less confusing. Alright, so we're going into the output folder, then we're going into tables. And here you can see I was doing things and I forgot, what do we call this one? Some stats, source stack, some stats source stack. I wonder if you're still going to be able to see this? Oh, you might not be able to see this. Okay, sharing a new table in a second. Yes.
Here you can see what was just opened. This is what that table is going to look like something and now you could manually edit it to do whatever you want. I'm a little bit worried that this is too much information for you guys if you're not familiar with it, so I apologize. But basically you can integrate this and create some sort of document and it'll automatically update. So for instance, if I run my code again and just say 0.4 centimeters or whatever, instead it'll immediately update. So if you have a draft that you're working on, you can just run the code and it'll change in that draft. So it's extremely useful and that's why I'm saying to familiarize yourselves, I still use overleaf. Overleaf is a law tech software online I guess that you can use to collaborate. Collaborate on law tech documents, and I'll show you that at the end because I don't have it open right now.
So again, this is a little bit of information overload for you guys, but I thought it'd be more valuable to show you how to use data or use these tasks and the kinds of skills and thought processes rather than the actual code. Because like I said, the actual code you can get from exchange, you can get from chat GPT. But I do recommend by the way that you learn how to do it properly, it will help you in the long run and medium run and also short run. So definitely don't just rely on chat GPT, it doesn't make mistakes. Oops, those are the answers. Lemme go back to the original one. Okay, so now we are at question four. Question four reads, the federal government does not use air. Now data for rulemaking
Martha:
All right, let me mute that. I don't know if that mute worked and this is a question that you do not need to write code for, so I will leave that open for you guys to solve. So let's move on to question five. Create a county day data set to proceed with analysis. All subsequent questions, we'll use this new data set. The data set should contain three pollution value, et cetera, et cetera. That's the one we're working on. So question five, county data set. Now, has anybody looked at this and thought about how to approach creating this dataset? Luca, maybe
Speaker 5:
I had a question regarding the purpose of the read me file. Should it be answering questions there, should it be, or should that just be a place to kind of explain where to execute the code or how to execute the code or just both all at the same time?
Martha:
That's a super good question. So a README file in general is just how to run the code, how to execute the code. You can include where you got the different things in general, especially as a pre-doc, the more detailed the better. You do not need a read me file for a data task. In general, you do need a writeup for almost every data task. And so that can be sort of similar. How much details you want to include is kind of up to you. And again, I'm sort of including information that I think is going to be useful once you're actually a pre-doc rather than just for the task specifically. But if that gets a little bit confusing, I'm happy to clarify. Does that make sense, Luca?
Speaker 5:
That does, thank you.
Martha:
Okay, great. So has anybody looked at question five and have a thought on how to approach it and no worries. Yeah,
Blagoja:
So do we need just to have a data set that doesn't have all of the columns, but in fact we have to just specifically select the columns that we are asked for so we can proceed to the data analysis? Correct?
Martha:
Yeah. Is there any other transformation that you think might need to happen?
Blagoja:
I think we can use, because I assume you're working with Tidyverse package, we can use the function select, so we can select all of those columns that are relevant for our analysis.
Martha:
Yeah. Okay. So I'll show you what we have in mind. Okay? Basically you want to collapse it by county day. So you do this by taking the meet across monitors within the county date. And that's because we have multiple observations for a county. They're multiple sites. So you need to synthesize, I collapsed that so that you can proceed with the rest of the analysis and I am keeping all of them here named, right? So technically these are one-to-one mapping. So there's one county code per county name per CBSA code per CBSA name. That's something you should check in general because like said, anytime you do a merge, anytime you do a data manipulation, you want to see if any information is getting lost. Alright? And then said, we are just now summarizing here, taking the means of these valuables to make a new data set. Great.
Then question six, how many counties are missing days? So here there's again a few different ways to answer this question, but that's how I did it. We're looking at the distinct dates and you want to make sure that you are including leap years and everything like that. And we get our answer right here in the console. 31 31 counties are missing days. Now this was just something kind of like fun I ran into, but just in general when you're looking at the dataset in the same time, because it could be the kind of thing that while PIs aren't trying to trick you, they are trying to see how deeply you think about the data. And so here we are in California, and so this is for instance is Alameda County with FIPs code one, and we can count how many unique counties there are.
But something to notice eventually when you do this is that there are fewer distinct county names in this dataset than there are counties in California. And the way that I noticed this is because Yuba County is missing and we are at the end of the alphabet here, okay? That's not something that's asked on this data task, it's just something in general, the kind of thing you might be looking out for, right? Because if for instance, there's a question that asks you to create the average for the entire state, you might at some point want to include that not the entire state is covered because there's counties missing. So just like here you're looking at how many days are missing, that might be something to be on the lookout for and that's why it's important to really have a kind of good sense of the data. So spend some time looking at it. All right, moving on. Now we have section two.
Another thing that's nice in general is creating explicit sections in your code like this. And I'm labeling the questions both for the purposes of the session, but also I think it can be helpful to have reviewing it to know where you are. So question one reads, produce a plot showing the distributions of ozone and PN 2.5. The distributions should be separate lines, set of dots, bars, et cetera. But on the same set of axes, choose an appropriate type of graph to complete the task and make your graph easy to digest. All right, has anybody thought about how to make this graph and what kind of graph we should use and things to look out for?
Okay, maybe not. So what we want to watch out for here is scale because the PM 2.5 and the ozone do not necessarily have the same scale. So in that case, what we're going to do is standardize. Alright, here we're standardizing the two variables of interest. Now then we are going to plot the density and we're going to do this like this and we're going to look at the plot that we made. I'm just going to type it in here. There it is. Alright, you can specify whatever line colors you want and then you could also add a title to this graph. One reason if you want to do that is it depends on, okay, it depends on what you're going to do with this graph. If you're going to put it in law tech, then you might have a figure name specified within the document. If you're going to put it into word, maybe adding a figure name isn't quite as straightforward. So you can have the title here or say a person's just going to be looking at this graph. Then of course you want to have a title that's a little bit of extra info there.
Then before we are going to export it and save it. Sometimes I have issues with gg save there's other options. Great. Now question two, produce a time series plot of ozone for Los Angeles County code 0 3 7 in the month of February. Do you suspect auto correlation? How might you test for it? You don't need to test for it, but how would you do it if you were to? You can think about that now. I'm going to show you in the code in the meantime. Alright, so again here we're using GG plot. GG plot is the most standard way to create graphs. I would say it's one of the biggest advantages of R over Stata because it's really easy to use, much easier to customize. So you can specify different line types, different colors, there's all sorts of packages, et cetera, and it's just a little bit trickier and Stata.
And so within kind of the economics, general academic chatter discourse, there's always these questions like oh R versus data. And my opinion is kind of just like they're better for different tasks. So for instance and RI like that you can have multiple data sets open at one time, whereas in Stata you can only have one data set open at a time. And I think that's just way nicer in RI think graphing is nicer in R as well. Regression analysis I actually prefer to do in Stata. So we're doing the next of course in a data task, don't mix which softwares you're using or which programs you're using, but just some thoughts in terms of how you want to approach a task. And especially if you are working and your PI gives you some liberty in terms of what to do. I mean do what works best for you as long as everybody can understand what you're still doing. So I'm just going to go through some of what I'm doing here. You're specifying the line width and the color. We're going to add some point, we're going to add a scale for the dates and label. We're going to say, okay, this is going to be the legend position. We're going to say none, and we're going to say we're going to center this plot this time and I'm going to put it here all la. That way we will open it. So let's run this.
Great. So here we can see the LA County's ozone in February. Now we have a title and said there's all different sorts of customizations that you can do that are super useful potentially in GT plot two. All right, nearing the end almost. Does anybody want to answer or attempt to answer the auto care relation question? That was a part of this? Yeah,
Blagoja:
I think from econometrics knowledge for auto correlation and because we were using time series data, do we need to use Turbin Watson test to see whether the variable is auto correlated?
Martha:
You can certainly test for it and do that test explicitly. Sometimes you want to use your intuition as well. So like Eric talked about in the panel or I think VAISHAL as well mentioned it, do you have any intuitions about what might be happening here? Okay, we'll leave that for you to answer and think about. But yeah, definitely you can propose a range of formal or informal tests, for instance an A RP regression, but lots of different things you can do here. That's one of the nice things about economics is that you have plenty of range. Okay, now let's move on to part three, pollution and mortality. Now we're looking at the relationship between pollution and mortality. The air quality index, AQI is an index designed to measure the aggregate effect of different pollutants on air quality. Alright, so abbreviated question one, estimate the association between pollution and mortality by running the regression that they specify.
Okay, so let's do a baseline regression again, there's a few different, oh, and we also want to label that now we are in part three. Well there's different packages that you can use. In this case I am using this one for fixed effects regression. Alright, we'll run the regression where clustering by county code. And now we're saying that some observations were removed because of NA values in general. That's something that you really going to want to think about. If anything is getting removed, is that okay? Is it a concern? And then when relevant also report on that. Now we'll get the coefficient here, we can see, oh okay, 0.528. Excellent. And now I just exported this baseline coefficient to tech table. So if I want to check it, open it. I would also consider exporting a full regression table here. So if I was doing this task, that's what I would do.
I did just the coefficient because I was matching the data code. So like I said, there's often more correct answers than just one. Now moving on to two, I am concerned that yesterday's pollution affects mortality too. Include one lag of a QI and rerun the regression report and interpret the coefficients on a QI and yesterday's A QI. Alright, so we're just going to do the same, but we're going to add some lags like this and you can see the lag there. And oh, maybe we want to be consistent. So we don't call it m, we call it reg instance and now we're saying that it's re or that it's a lag. But in general, is this a good practice? No, this is not a good practice because reg is super unspecific, right? So regression could be anything you're going to want to be more specific probably when you're doing this task, kind of like I mentioned at the beginning. And so I dunno, call this like Greg mor AQI or something and then we call this one lag just as the thought. Again, you can glad not found. Yeah, of course I didn't change it everywhere else as well
If you have to do that. Another useful thing is the find function, right? So I could have find reg, replace the individual instances. Sometimes it's also nice if you have a regression and then you give whatever you're outputting the same name. So for instance, you might want to call this table and so that way it can be really easy to trace across of what you're doing, where you are, where you got it, et cetera. Because the more varied names you have floating around in the universe, the harder it is to keep track. So let's see if we bought everything or not looks like this in. But again, we have these removed observations which when you do this task you should check out to see if that matters or not.
Okay, so we've done question two. Next step. Question three, we're curious about the relationship and whether it differs by whether a county is in A-C-B-S-A. We run the original regression from section three question one, but add an interaction term between an indicator for whether a county is in A-C-B-S-A and a QI interpret the coefficients. So the first thing that we'll want to do is create an indicator variable that's us doing that. Then we will want to run a regression on the interaction like this. Here is where that interaction is and Stata, a lot of times you'll see it with two pound signs for the young people. That's a pound sign, not a hashtag.
That was a very boomer thing to say, but there you go. And again, we might want to call this something a little bit more specific word, a QI. Now we have an interaction and we can see that that's the main difference, right? Between this one we have the interaction term here, we don't have the interaction term here. We have a lag here, we don't have a lag. The pi, whoever's reviewing the data task probably won't look that closely, but I think it can be really helpful like said to keep track of yourself and for others what you're doing. So we can run this again, we have some remove variables of co linearity. Okay, something for you to think about now, the rest. Okay, question four. I did not ask you to include a separate variable for whether a county is in A-C-B-S-A, but just the interaction if you were to include it, it would be emitted as colinear.
So did we include we didn't, what is it colinear with? If anybody is tracking that, if you have a thought on what it could be colinear with, you can raise your hand or even put it in the chat. If not, that's when you solve this task. I think now you have kind of a good idea of what they might look like, the different kinds of challenges you might face. So for instance, we started at the beginning here with needing to harmonize some variables. Then we had like, okay, you need collapse it to get the correct means for which you needed to understand the structure of the dataset a bit better. You'll probably be asked to do some analysis and you'll be asked to interpret your regression output as well. So that's kind of what they were saying in the other section session as well, that it's usually not enough to just be technically very skilled.
I'm sure you could write much more beautiful code than this, right? You also want to be able to say what you're actually doing. Okay, so I'm skipping that part for you to do and I'm going to show you just super quickly what latte looks like. Again, just so that you familiarize yourself with it, I highly recommend it. Again, it's not necessary, it's not expected or if it will be specified. So for my first pre-doc, it was definitely expected and specified that I should know how to use law tech. And I'm going to share my screen. You guys can already think about if you have any questions. This is one way that law tech can look and they just actually changed the appearance. So this is in beta, it can look a little bit different. But basically you have these different files. So here I'm in Maine. In your preamble you specify all these different packages. You want this to look a lot cleaner than what I have here. This is kind of handed down. I add things, I remove things. You can specify specific commentaries.
So for instance, if I want to create a list, a bullet point list, then I can add this new, can you guys see this? By the way? Can someone nod? Okay, good. If I'm going to create a bullet point lists here, I'm specifying a shortcut. So there's all sorts of different simplifications. And here you can see basically, for instance, there's a bullet point that I specified. Like I said, it really shines in math mode. Then you can import and export the tables. So here's one of the tables that we made earlier, and here you can see I have a little bit of space. This is something I mentioned before. Now suppose we take this vertical space out. What's the problem? Why did I do that? That's why. And so in law tech you'll run into those kinds of issues and it can take a second to get used to, but like I said, super duper useful. That was not part of the official dataset data task review. Just like my personal view on something that I think is important. Do you guys have any questions? Okay. Oh yeah, go ahead.
Blagoja:
Quick question in regards to, first of all, thank you so much for introducing us to the main steps for these tasks. My question is related to question three under section three. So it says, I'm curious if the relationship deferred by whether a county is in A-C-B-S-A and we need to technically what understood to make a D dummy variable. If that observation was in the CBSA is the county is in A-C-B-S-A or not one or zero otherwise? My question is when we are making that dummy variable, like the interaction term, how do we know whether the observation is in A-C-B-S-A area? I was a little bit confused understanding that question. Thank
Martha:
You. Okay, so a lot of times an indicator variable is like a binary zero one. So if it's in A-C-B-S-A, the valuable will be one. And if it's not, it'll be zero y or Janeta.
Speaker 4:
It's pronounced Janeta.
Martha:
Sorry,
Speaker 4:
I was wondering if in what circumstances, if any people would use our markdown as opposed to just a plain R script?
Martha:
Oh, absolutely. You can use our markdown a hundred percent
Speaker 4:
For a generic data test.
Martha:
I would say so yeah, you want your output to look clean, right? You want to impress as much as you can. I considered it, but yeah, definitely and a great question to ask. But yeah, you can definitely use that. Like I said, there's many ways that you can approach these tasks. One isn't necessarily going to be more right or wrong than another. You just want to have your code be clean, well annotated clear, and after that it's kind of like, and correct. That helps and said if you guys are doing this task later on and you find either a mistake or just something that you think could be better, feel free to email me. I'm actually going to put my email into the chat right now. I don't know if that goes to everybody, but maybe. And now if you have any other questions, I'm actually going to stop the recording and.
2025 Data Task Review in STATA
See current pre-doc at the Becker Friedman Institute for Economics, Scott Blatte, solve a data task in STATA. Find the problem set here and the solutions here.
Scott:
So welcome everyone. This is the Stata workshop room. If you want to be in the R room, just switch to the other breakout room. How this is going to work is most of you hopefully got an email from Steven or from the predoc.org email this morning with the data task. It's okay if you didn't. It's okay if you haven't looked at it. The idea here is we're going to walk through it together and then you can look at it after. It'll be posted online, spend a couple minutes just kind of going over data tasks in general, giving a background about myself, and then we're going to go and dive right into it. There's only 20 of us, so I think if you have a question, you want a clarification on anything, I think you can feel free to just unmute yourself and interrupt and that should be okay and there will be various points I'll probably ask if anyone wants to give an answer or suggest an answer.
And there same thing, I think you can just unmute and start talking and if people talk over each other we'll figure that out then. But a quick background about myself. I was for the last two years, a pre-doc at the University of Chicago at the Becker Friedman Institute. I worked for Professor Josh Gottlieb there. I was mostly doing health economics before that. I graduated from Tufts University outside of Boston and starting in a couple of weeks, I'll be a first year PhD student at Northwestern. So I'm going to share my screen and we'll go ahead and get started.
So I'm going to give just maybe a two minute overview of how data tasks work And this data task like Steven mentioned, is a bit longer than normal. I wrote five hours. To be honest, I have no idea if that's an accurate timing. It's certainly more questions than I've seen with the idea being data tests, taking on a variety of different forms. You'll be asked to do a lot of different things and I think it makes sense to try to cover as much as possible, at least give you more material to look at. So not all positions have data tasks, but certainly those that are going to be data focused, usually anything with micro or even things in macro, maybe a little less so in metrics, these positions are probably going to ask you to do a data task. And what these are going to be is they're usually going to get some amount of time anywhere from say a couple days to up to a week to do the task. Within that time, they're going to ask you to spend a fixed amount of time on it. Now, most of the time it's do it at your own pace to be as honest as possible. If it says don't spend more than five hours, try not to, but rarely are they going to be timing you and saying, okay, if you actually hit five hours of work time, you have to send it. That's not normally how this is going to work.
So what you should hopefully see on the screen is the data task, and I wrote this one, so if you see any errors, if you see any issues, please let me know so I can fix them here and we can get them fixed before they go up on the website. But the data task here is going to have four sections. The first section is just going to be some very basic data cleaning, merging the file, getting things prepped for analysis, a couple summary stats. Second section is going to focus on plotting and visualization. Third section will be some pretty simple regressions. And then the fourth section, it's maybe a little different than a typical data task and that's going to really ask you to think a little more. There's not going to be actually any numbers or any code you need to write for that.
It's going to be more open-ended questions and these are questions that might actually come up more so in an interview as a follow-up to a data task or a follow-up to say something that you had written as a writing sample and submit it in advance. But I think it's worth spending some time on right now. So I'm going to try to go a little faster through the earlier sections just because those are more basic, they're more cut and dry here, do this, do that and we'll try to get to the more open-ended things, which might be a little harder to do. Again, feel free to interrupt with any questions, but we will go ahead and get started and let me know if you can't, for whatever reason can't see anything on my screen, I will try to keep an eye on the chat as well, but I don't think you have access to that, so it'd be easier if you just interrupt me as I'm going.
Great, so I've already prepared the entire script. I figured it would be a better use of our time to not watch me make typos as I type. This is I think a decently commented status script. No matter what language you end up doing a data task in, you should certainly make sure you're being very clear with your comments and organization. That's actually a pretty surprising part largely of how we evaluate data tasks with data tasks. A lot of people are going to get them mostly right? Usually they aren't asking you to do anything immensely complicated. And so what can distinguish you from other people is are you commenting your code? Are you organizing it as cleanly as possible? Are you really spending time to make sure it's readable and that someone else me, if I was grading it, if you were looking at it as someone else who had written it, can they understand this code?
And so you see, I am trying to label put sections into my code. I have individual comments that are going to correspond to the questions we're going to see and generally just not everything there. We have spacing. It's not one large stream of consciousness, set of lines, simple things like that that can actually go quite a long way. And so I would certainly make sure you do that before you submit any data task, even if it's 20 extra minutes, it's absolutely worth those 20 extra minutes. Okay, so with that said, let's get into it. What have I asked you to do here? I've given you two files. They're going to be both data sets with pollution data in it. One bay you can tell from the names is going to be ozone. That's a type of pollutant in the atmosphere. One is going to be PM 2.5 or PM 25 in the name here, and that's just a different type of pollutant.
At the end of this data task, I've put a data dictionary defining all the variables. You'll usually see one of these if someone's provided you data without the proper labels or even with the labels, they'll probably put this separately because labels don't always appear depending on the language you're using. But what we're going to do is just go chronologically through this task. Again, I'm going to try to skip the first few questions or at least go pretty fast. So the first question is just telling you what the data is. It's going to tell you that this is a county monitor day level of observation. What does that mean? That means for each county on each day at each monitor, so there could be multiple monitors to collecting pollution data within each county, but there will never be more than one observation for every county monitor day combination.
Both files are that way. And all I've asked you to hear is to do is merge in the data. So first thing I'm going to do is just read in, well first I'm going to set my working directory, which to a directory I know is valid and then I'm just going to read in the first file and we're going to describe, so I've asked you to merge in the data and there are three unique identifiers, right? There's going to be the county, there's going to be the date and there's going to be the monitor, which as if you read the data dictionary at the end, I would tell you it was the site ID variable.
One thing we want to keep in mind is are these variables of the same type in the two separate files? Because if they're not, they're going to cause issues and spoil air alert. I've intentionally made it so they're not that. So that there would be some cleaning you have to do in advance to make it so that the data merge cleanly. So here, notice in this first file that the county code is an integer. So numeric type, the date is a string and then the site ID is numeric as well. Now I've already pre-written this code, so you're not going to actually see the exploratory thing, but if I were you, what I would then do is I would then look at the second dataset and do the exact same thing and see what the level type is for those variables and you'll see very clearly, well the county code is a string here.
Site ID is numeric, that's okay, but the date is a float and it's displayed as a date. So essentially we have both the date and the county code that are misaligned, so we're going to need to fix those before we do the merge. I've written that in the code here. I am going to do that in the first file and there's nothing saying you had to do it this way. You could have changed the types to match the first file or you could have changed the first files types to match the second file. I just happen to be doing it in the first one. So what am I going to do? I'm going to use the built-in state of date command to convert this to a date and then have it display in the same way we see it here that it's displayed in the second file as time date.
So I'm going to want to do it the same way here. So let's do that and that's going to make this a date. And then I've listed one observation just to confirm that looks like a date that looks valid. The second thing I'm going to do is I'm going to turn the county code into a string to match the second file, and that's again using just the state of command two string. I'm going to have this format in here because if you had looked at the county codes when there are strings, sometimes they have leading zeros, but whenever a numeric variable is created, it always drops the leading zeros because they don't convey any additional information, right? A number with the leading zero is the same as a number without the leading zero, but when we're converting it back into a string, we need to have that leading zero in there. If not, the merge would be incomplete because you wouldn't match the observations with the leading zeros. Okay, we're going to do that. Great, and now we're just going to go ahead and do the merge. And if all has gone well as it has, you should see a perfect match.
One Stata idiosyncrasy that I think is good coding practice that I am not sure how much other PIs or teams value, but it's something we have always done is you should make use of assert statements. You should do this in any language, right? Assert statements tell you that something you expect to be true is actually true. So suppose I had a zip code. Well, I expect it to be say five digits long. You could write an assert statement that every observation has a five digit long zip code. Likewise here, I expect this to be a perfect match and that's partly because I've designed the files such that they're a perfect match so that they match. And if they didn't match, if there was an observation on one file that wasn't present in the other, this would throw an error, it would break my script and it would inform me that something's gone wrong. Otherwise I might not know that and I might continue as if everything went well.
So that's the first question. That's pretty simple, right? Just keep an eye out for things like that, whether it's intentional or not. Sometimes people are going to try to trick you. As Steven mentioned in the main session. Sometimes PIs are going to throw in outliers, they're going to throw in NAS or clearly invalid numbers, things that they want to kind of get. You want to see if you're, how closely you're paying attention, and this is just one of those examples. The next few questions are going to be pretty straightforward. They're going to ask you just some summary statistics, right? I want the mean medium median, minimum, maximum and standard deviation for the following three variables just for every observation in the dataset.
So how am I going to do that? Well, I'm just going to produce that table using the state of command tab stat or S post tab stat and then output it using S tab. So all I'm really doing here is I'm labeling the variables and that's just going to make my table look nicer when I output it. Then I'm specifying, here are the five things that I want to be produced in the table corresponding with the question. I'm going to add labels and then I'm just going to output it. So instead, this is what it looks like. You see the output has been written to this tech file and I can show you in the output. Now this would be an example answer of a table that looks pretty decent. There's this extra floating row here, I don't really care about that. Maybe you ought to get rid of it, but otherwise the table is pretty straightforward. It's answering the question, it's providing the information that's necessary.
We're not having variable names in these row labels. That's a common thing I've seen people do try not to do that. That's not something you would put in a memo or a writeup or a paper. So try not to put that here. Now one thing that could be improved is do we really need five digits past the decimal point for the mean of a QI probably not. So we could always have rounded that, but that's pretty second order. Now, the next thing we're going to do is if we look at the question, I want to produce the same table, I just want to do it for ozone. So just that second variable, but now I want to split the sample by the source variable. A source variable. Again, if you read down to the data dictionary, would tell you it's about how the monitor reads in the data.
Basically, you can think of it as the a QS is more rigorous. It's gone through some more sanity checks on the data collection side, whereas the air now is kind of the instantaneous reading of the monitor without checking if there's been a technical issue or if it was really an anomalous reading. And so I'm just interested to see do these moments look different depending on where they came from? And I'm going to run this and all I've done here is same command. If you'll notice, all I'm just doing is splitting it by ozone source. So that's the source variable that I just indicated for ozone, and I've only used the ozone value here because the question only asks for ozone.
And what would the output of this look like? It's going to be pretty similar here because it's just ozone. I've replaced what was previously the variable name with now the source name. You probably want to indicate that somewhere in your answer in a title. And I've also added the number of observations in each group and then in the total, and I believe the question, yes, the question also said also include the number of observations. That's as simple as adding the countin into this command. Okay, so here it's easier to read if we go back here. So next question. So the federal government actually does not use this error. Now, data for rulemaking, essentially they don't use any data that's only collected from the Air Now source because they're not validated, they're not certified, and it can't be used to support any policy. So if I just looked at these two tables, I guess especially that second table looking at the ozone output, the question here is the federal government, right? Can we tell if the Air NOW data is better or worse than the A QS data?
I'll leave it up to you. Does anyone want to take a stab at that question? Yes or no? And if so, why or why not? No worries if not, I'll give you guys like 10 seconds to decide whether you want to answer and if not, I'll just go ahead and do the answer myself. Okay, no worries. We've all been there. No one wants to speak in front of a big group of strangers. So this is a bit of a trick question in that I look at this table and I see, okay, the A QSA slightly higher mean slightly higher median standard deviation is also a little higher for the A QS, but the reality is this is just not enough information. Simply knowing whether the mean or the minimum or the maximum of these two different distributions is not going to tell me anything about how high quality the data is, right?
You could have low quality data that creates this exact same pattern, right? Let me give you this example. Suppose that I took the same data that was under the AQS data. I said, okay, this is also the Air NOW data, except I completely randomized where the Air Now data was taken from. So it's the same numbers, but they're no longer, they're just completely randomly assigned to observations. This would actually generate basically the same mean for both Air Now and AQS, right? It's the same data, it's the same number of observations, so it's going to be the same mean, it's going to be the same median, but I've just told you that I've to completely randomized the data, meaning of course it's low quality, it has nothing to do with what was actually read. So this table actually doesn't tell us anything, and we don't know. Maybe the federal government is still wrong, but we have not seen enough information to do that without any questions. I'll pause for a couple seconds and then we'll keep going.
Okay, great. So the next question says, okay, I want to create a county day dataset to proceed with the analysis. I want it to have the three pollution variables. That would be a QI PM ozone. I want it to have mortality, the county name and code, the date and the CBS name and code. Then it asks basically, how do you create this dataset? How did you reconcile different pollution readings for different monitors in the same county on the same day? I want a county by day dataset, but within each county on each day there could be five monitors with five different readings instead. This is actually very easy. It's a collapse command. Collapse essentially says I have the data at say it's XY level, I want to collapse it or aggregated up to the X level or the Y level from something where it was finer.
You'll notice I've used something called with a G collapse. All that is it's the same collapse command except it's just built by some people who have contributed this G tools package. And if you download that package and you put a G in front of some of the commands like collapse, reshape, they work a lot faster, they have a lot better speed. Run times status are generally pretty slow language because it's very high level. And so realizing speed improvements when you have very large data sets can actually be very nice. It doesn't really matter for the data task, but it's something I just always do. So you'll notice that for all the key variables I specified, the three pollution variables plus the mortality variable, I'm taking the mean. The reason I'm going to take the mean here is because the A QI, the ozone, the PM value, these are stocks, right?
This is the amount of pollution in an area at once. So if I'm trying to understand what the pollution is for a county, I don't want the sum of pollution, right? Because that would be saying the reading at one monitor plus the reading at the other monitor is the pollution that the county experiences when in reality it's not right. The county is a spatial object. It's a piece of land with different readings, and the average is the average across all the units of land. And so our best attempt to do that is to take the average across the different monitors. Now, we don't know anything about how close the monitors are to each other, whether they're equally covering the entire county. That's something we're not going to answer in this data task. And then what I'm going to collapse it by is all the information that I want to keep, so that's county code and date are the two main things, but county name, I also want the name and I also want the CBSA code and I put it in the buy group because the CBSA is always the same for every county, so it's not going to create any new observations because if I'm collapsing to the county and date level, then that county is always going to have the same CBSA.
There's not going to be county date CCBs A one county date CBSA two. It would always be county date CBSA.
So I've done that very simply. I ask how many counties are missing days. I can do that very simply by counting the number of observations or the number of different dates within each county code. And then I can tell you how many different county codes there are. If this variable is less than 366, this was done in 2024, so it's technically a leap year. If you hadn't noticed that you'd run less than 365, that would not have been an issue. And I forgot it at first and then noticed that there are some with 366 and confused myself. So we'll see that 31 counties have no or have at least one missing value, right? It means they have less than 366 unique days, which means there's at least one day missing and that would finish the first section. I'll pause here. Does anyone have any questions about this section?
Great, so we're going to move on to section two. So section two is visualizing the data, and I should say these sections, you probably won't see a data task necessarily organized like this. I've organized it like this so that you can follow along if you're doing it on your own or if you're doing it watching this recording. And so you can see groups of questions that you can knock out together. You can almost think about it as an average data task might have two or three of the questions from each section, and that would form a single data task with no sections. Okay? So the first question here asks you to produce a plot showing the distribution of ozone and PM 2.5. The distribution should be separate lines, sets of dots or bars or something. But on the same AEs, IE, the same graph, choose an appropriate type of graph to complete this task.
Make your graph essentially presentable, professional looking, easy to digest. And then there's a hint here that says pay attention to the scale of the variables. Can we preserve the distribution while standardizing the scales? I say this here because it goes to this last point about something easy to digest. If you recall from our first section, we saw that ozone and PM 2.5 are on very different scales. The max ozone value is smaller. Well, I guess it's not smaller because it's a negative, but the max ozone value is quite a bit smaller than even the median PM 2.5 reading. So you can imagine if I took these and I tried to plot them in any way, shape or form, it's essentially going to be all the ozone values and very clustered around zero on the graph. And then the PM values taking up all the rest of the space, going from all the way from negative two to 154. That's not going to be very informative because the ozone values are going to look like a single clump with no visually identifiable pattern. So that's why I present this hint that says something to the effect of pay attention to the scale.
What we can do here is we can scale the distributions or scale the variables to standardize them. And essentially what I suggest doing, although you don't have to do it this way, is converting these into something with mean zero standard deviation. One essentially as Z-score. Take every variable, subtract off the mean and divide by the standard deviation, and that'll allow us to plot these two distributions on the same mean zero standard deviation, one graph, and that's going to be a lot more interpretable. I think the easiest way to do that is probably a K density plot or a histogram. You've probably heard of a histogram. You might've also heard of a K density plot, right? A histogram is just a bar chart, but it's continuous, right? So with a continuous variable with ranges, right? So it'd be like one to 10 is one bar, 10.1 to 19 is the second bar, so on and so forth.
A K density plot uses what's called a kernel to essentially plot a smooth distribution. So if you've ever seen what looks like the normal bell curve that was probably produced using something like a K density plot and K density plots, they are nice because they produce very nice looking plots. They can be finicky because there's actually no standard way necessarily to choose both the kernel estimator. So how do you actually determine what the shape of that line looks like to smooth it out and the bandwidth, which also tells you something about smoothing, but those are questions they don't nearly need to worry about here. Okay, so what am I going to do? So yeah, we've already seen this, right? If I plot the distribution of these two variables, well the means are very different. This tells me that plotting them as is is going to cause me problems. Here I'm going to use a standard convert, a Stata function that's going to convert these values into standardized things by essentially subtracting the mean and dividing by the standard deviation. So I'm going to convert both of those variables that way. And then I'm going to make a plot.
And like I said, I'm going to do a K density plot here. And so this is what my output looks like. We can see that the peak of PM 2.5 is a little higher. And so what does that mean? It means there's a little more density around what appears to be the area with the most mass. So round zero, give or take maybe a little negative. There's a more mass than there is with the ozone, but we also see that the PM 2.5 tail goes out quite a bit and it looks indistinguishable from zero. But if it is there and it's that far out, that means that there is the data somewhere around there. And we saw that we saw a variable that was 154 where the mean was seven. So that means there are more extreme PM outliers in the data that there are more extreme values.
It may only be a couple. Some other things to keep in mind just looking at this plot, I've labeled my axes, they're pretty presentable. I have a legend that is at the bottom. It's not taking up too much space corresponds with the plot. This is something that would totally satisfy me if I was reviewing a data task or most people, if they're looking at a data task, I'm sure it could be improved presentation wise, maybe get rid of this grid for example, that type of thing. But this is what would be a good plot. And as you could see in state, it really doesn't take all that much effort, right? We're talking, I've built a line break in here, but it's really just two lines of code and that gives me that plot to answer that question. You could also have done a histogram, and this is one option for a histogram.
Now you'll see histograms where you have two different distributions and we want to plot on top of each other. Those can be a bit difficult to parse out because they really cover each other. Now this is kind of the unedited version now there's nothing stopping you from improving it. So this is an improved version might look like where I've tried to use colors that don't complete the overlap, I've chopped off some of the outliers so the bars appear bigger and that type of thing. I still find this a little harder to parse, and so that's why I would've gone with the K density plot, but I think this would've been totally acceptable as well.
So we're going to do one more visualization question and that's going to be produce a time series plot of ozone for Los Angeles County. It's the county code oh three seven in the month of February. Do you suspect auto correlation and how might you test for it? But don't test for it, at least not yet. So okay, let's answer the first part of that question. You'll see that the county code as we earlier, convert it to a string. I'm going to destroy it. I don't think you need to do that. I think I just did that for my own sake. And then what I'm going to do here is I'm going to set, and this is a commandments data where you set the panel variables and what that means is you're telling data that I have a panel data set where the geography or kind of person or whatever the panel might be, the observation the unit is here at county and the date variable index is time. Now that's nice because data also has a separate command that'll allow you within two way to plot a time series graph with just, and it'll automatically do some smoothing and connect the lines in what looks like an appropriate time series plot. So I'm going to set this and then I'm going to make that plot and that's going to generate this.
So we're going to have ozone, the value of ozone on the Y axis, we have the date on the x axis. You'll notice if you look at the code that this is not what the original date looks like. If I were to comment this out and run this line, sorry, you'll see that it actually looks like this. And you'll see this in plenty of places in papers. People leave the x axis looking like this when they're using state of time series. I personally just find it's a little unprofessional. I think you could do a little better with simply one line to generate it in a more human readable format. So now that we have our plot, you'll recall the second part of the question says, do you suspect auto correlation? And if you scroll down to the footnote, I define auto correlation is auto being self. So auto correlation is with something is correlated with itself over time. Here it's the equivalent of saying today's pollution is correlated with yesterday's pollution. It's not completely random every day. It's that if you had high pollution yesterday, you're more likely to have high pollution today and vice versa. And it could very well be if you had high pollution two days ago, you're still more likely to have high pollution today, even after considering yesterday, right? It might be that two or three days ago matters for determining today's pollution.
So the question I seek to answer is, well, is that the case here? I'm not looking for a very precise answer, but I would say yes. And the reason I would say yes is you look at these plot, look at these clumps of days. For example, high pollution here leads to high pollution here, okay, we see a fall off here, but then it continues to fall the next day and it reverses. Stays high up here, stays high up here, falls a bit. Of course it's not perfect, right? But we're not going, it doesn't look like it's completely random, right? These points, if there's an upswing in pollution, it's usually over a course of three or four days and then it turns down again, right? It hits its peak. And if you do any sort of tests simply you could create the variable, you could create a variable of yesterday's pollution and just run correlate today's pollution. Yesterday's pollution, you'll see it's quite large. You could do one other option and that I think yes. Here I do ask you how might you test for it?
You could do something what's called test an AR or an auto regressive model, and we'll get to that actually that's later in the section as you can kind of see in the section four questions. An AR model essentially says the way you write it is actually ar. If you look down here in section four AR and a number in parentheses, and that's essentially telling me how many previous days do I think help predict today's pollution. And the way to estimate that is you simply create new variables that are the lags or the previous day values of pollution and you regress today's pollution on all those values. And if you continue to see positive or negative but something that's significant for those values, then that tells you or gives you some evidence that there's auto correlation present and then an AR model, AR 1, 2, 3, et cetera, something there is going to help you predict pollution and that might in fact be the process that generates pollution. It might be that it's just yesterday's value plus some randomness.
So what would be an acceptable answer for something like this? It's more open-ended. Do you suspect auto correlation? How might you test for it? I mean do you suspect might be as simple as yes or no, and I've looked at the plot and it looked like tomorrow's pollution seems to be closer to today's pollution than if it had been just completely random. That's a perfectly acceptable answer. That's all it's looking really looking for is looking at the plot and making some observations off of it and then how might you test for anything that I just said correlation, auto aggressive, anything of that flavor would do a good job here.
Okay, I'll pause here and again, I'll ask, does anyone have any questions? If you do, please unmute. I'll give you guys 10 or 15 seconds and then we'll move on to section three. Oh, hi. Sorry. Just one question for commands like describe, do you keep them in your code when you submit them? It's a great question. I'm not opposed to it. I think it's especially here where I'm using it to inform something so up here in describe, I'm saying, okay, I can tell the data string because I've described it, and so you're right in that the describe command, all it does is print something visually. It is almost never necessary for the code when I'm specifically using it to inform later code and sometimes that's an expost thing. I only know I need to use the describe command. If I tried to do this and I see the merge failed and then I have to go back. In that case, I think it's totally fine to keep it in other cases if you're just doing it because maybe you don't need to keep it in. All it's going to do is create extra output. But in general, I don't think it's that big of a deal because usually it doesn't take, it never takes a very long time to run and it can be informative when you're reading in a new data set especially. Okay, thank you. Of course. Any other questions on this stuff so far?
Great. Okay, we're to go to section three now. And so section three is actually running regression. Finally, we're now investigate the relationship between pollution and mortality. I've given you a variable. It's the air quality index and it's designed to measure the aggregate effect of different pollutants on air quality, right? We've seen so far we've worked with two pollutants PM 2.5 and ozone, but there's also nitrogen oxide. There's also PM 10, there's all these different pollutants and so how do we get a single measure of air quality? AQI is one that's commonly used, and so why I want to figure out, okay, does AQI cause death and of course AQI is just a number, but what's representing is pollution, right? So it's essentially asking does pollution cause mortality? I'll note here that all the other data I've given you is absolutely real. I downloaded it straight from the EPAI made up the mortality data, and that's mostly because county mortality data is very hard to get publicly. You have to usually be looking, especially for smaller counties. And so I think it would be easier to just do it in a process. I know that would work well for the data task. I think most of the time in a data task you'll always be getting real data, but here I figured this is the easiest way to do it.
First question, I just simply want to estimate the association between pollution and mortality. I want to regress mortality on pollution, pollution being AQI, and then I also want to include fixed effects for county eye and time T. And then I want to report and interpret B one and I want to ask why do we include fixed effects? Okay, so how do we do this in the code? Well, this is actually a pretty straightforward thing in this data. The normal state of command is regress or are abbreviated REG. When you have fixed effects, generally people have started to use reg HDFE, which essentially stands for regress with high dimensional fixed effects. What does that mean? It means if I have a lot of fixed effects, that's a lot of different variables and coefficients and things to ask or regress to handle. Stata has native functionality to help this with a reg or I believe XT reg can also do this, but reg HT FE is simply faster and what it does is you put your fixed effects in this absorb command and it'll essentially get rid of them before running the regression which realizes speed gains.
So anytime you see something like this with an absorb command, all it's basically telling you is you can think of that as a fixed effect. So I'm going to regress this. Okay, I get this coefficient. What I'm doing in these next two lines is that I'm simply extracting the coefficient and I'm going to print it to a tech file. This is not something you probably have to do for a data task. It is something you might have to do once you get the job. This is something I do almost every day where we do not allow for any you, you've probably read papers before where they say, okay, the effect of this is insert number, at least on my team, that number is always something like this. It's produced by the actual code saved off as a file so that there's no ability for humans to mess it up. We are always going to have it update anytime we make any changes to the code. So that's all that's happening here is I'm extracting the coefficient, I'm saving it. You don't need to worry too much about that. What we want to focus on is this coefficient of 0.527. So I ask you here to report and interpret beta one, and then I ask why do we include fixed effects? So the report interpretation part should be pretty straightforward.
This is saying for every unit increase in AQI mortality increases by 0.527. Don't worry too much about the units and maybe I'll clarify that in the task A QI is like an index, so it's for every increase of one in the index mortality is completely made up, right? I made that up. So if it's not specified the units, you don't have to worry too much about it. I think the more interesting question here is why do we include fixed effects? I'll pose that to the group. I have a feeling we might not get an answer and that's okay, but that would be my question to you guys. Why do we include fixed effects? Does anyone want to take a stab at that?
That's okay, no worries. So one of the reasons we include fixed effects is we want to think about the ideal comparison we want to make here. What would be the perfect comparison? Well, the perfect comparison was if we could observe the same county over time and then isolate just a marginal increase in pollution, everything in that county is the exact same say for the fact that there's been a little bit of an increase in pollution and that would very clearly tell us what the effective pollution on mortality is. And so that goal trying to make within group comparisons is achievable When we use fixed effects, what fixed effects essentially are is they take anything that is time or that's invariant to that unit. So if it's a county fixed effect and I say, okay, I'm going to include it in the regression, what it's doing is it's taking all the things, all the variables, all the factors that are unique to county, to a given county, but do not change over time and essentially taking them out of the regression.
You can think of it as a taking. There's a shared, if we have an observation in county one, suppose it's made up of two components, things unique things determined by pollution or by this mortality thing, and then things that are based on a series of fixed things. Maybe this county is higher in the mountains and so a PM 2.5 sinks to the ground and so they actually are not affected by the smog, something to that effect. Likewise with time, suppose pollution was very high in one year, but also suppose there was a pandemic and so people were inside. We don't want this thing that is applicable to all counties only in that year. So it's unique to everything, but within that year, we don't want those things to contaminate our regression. We want to take away these average things and focus on the within group comparisons, right? We don't want to compare county to county when we want to compare within county changes in pollution, and the way we achieve that is with fixed effects.
It's the same logic as controlling for omitted variables to avoid omitted variable bias, right? We include those because we're worried that when we don't include them, we're actually biasing our estimate. It's the same with fixed effects. Any questions on that? In the interest of time? We're going to move to question two and we're going to skip three for now so we can get to section four. So question two, I say I'm concerned that yesterday's pollution affects mortality. Two include one lag of AQI and rerun the rerun the regression report and interpret the coefficients. So in STATA, you can either create a variable that's the lag or within the regression because I've already told STATA what my panel identifiers are. It's a county at a date. STATA knows, knows that when I specify L one for the first lag of AQI, it should take the value for the previous date for that county. And so I don't need to create any new variables. I can just regress to this as is and it yields these results. So was my concern that yesterday's pollution matters valid? Well, yes, absolutely. We see major, major movements in the coefficient, right? What was once 5.27 above is now down to 3.04 and the lag value is actually higher 3.7 or 0.37.
Now, that actually might make a lot of sense, right? If there's a lot of pollution, if it killed people today, it would essentially say you breathe in the pollution and you die, say six hours later, and that's certainly possible. It could also be the case that you breathe in the pollution and you die 30 hours later. If that's simply more common, then we easily could get this pattern where there's an effect of today's pollution on people who die today. But there's also a larger effect from yesterday's pollution on the deaths today. All that's saying is that pollution takes time to kill you, and I think that's pretty consistent. It's not going to be instantaneous.
We're going to skip question three. All this is asking you to do is basically include an interaction between AQI, between AQI and between this indicator for whether you're in A-C-B-S-A. The question is are there different treatment effects or heterogeneous treatment effects whether you're a county in A-C-B-S-A or not? And the code, this is simply creating an indicator for whether the CBS code is missing or not and then running this regression. The short answer is no, because I created the fake mortality data and I didn't make it separate by cbsa. Of course, in real life that might not be the case. If you are dealing with real data, we don't know the answer.
And we'll also end up skipping question four. And question four is simply asking, when I run this regression, I guess maybe we'll spend 20 seconds on it when I run this regression, sorry, lemme generate it first. I get this omitted in CBSA value. So there's no separate coefficient for the variable. Are you in A-C-B-S-A yes or no? The question is why is that the case? The answer is because if I've excluded county fixed effects, sorry, I've included county fixed effects, right? I've now taken out things that don't change within a county. Well, I've already told you earlier in this task, and when I define the CCBs A that CBS A are a county is always in A-C-B-S-A or not. So CBSA is unique to the county level, but that's it. And so when I take out county fixed effects, I'm taking out anything that would be CBSA versus not CBSA specific. And so these are going to be completely coline. I've taken out all that variation and so Stata won't be able to estimate anything on it.
Okay, let's get to section four because I think this is maybe the most open-ended section. I think the one that makes you think the most. So this section does not ask you to do any coding whatsoever. What it asks you to do is essentially think, take our results and think a little bit critically about what they mean. So let's look at the first question. So suppose I know pollution follows an AR three process. All that essentially says is that today's pollution can be predicted by yesterday's pollution the day before and the day before that, and they all contribute something unique. So even knowing yesterday's pollution, the pollution from two days ago tells me something else about AR three. Sorry about today's pollution. And then same for the third lag. What that defines is an AR three process, and it also tells me that the partial auto correlation is greater than zero partial meaning if I controlled for pollution yesterday and the day before, I'd still get something positive between today's pollution and three days ago pollution.
So then the question is, do I know how many lags of pollution I should include in the regression to avoid omitted variables bias? Doesn't matter how these lags associate with my outcome. Does anyone have any thoughts on this? Again, in these just time and recognizing everyone's a little timid today, I'll go ahead and answer it. So the answer is absolutely, it depends on how the lags associated with my outcome. What I'm telling you is that pollution, the regression we've been running is mortality on pollution. If I tell you that pollution is an AR three process, that means pollution is predicted by the previous three days of pollution as well. So an instinct might be okay, I want to include all three of those lags in the regression. But suppose I told you that mortality does not depend on anything besides today's pollution, but even if pollution is predicted by the previous three days mortality itself, people who breathe in today's air die today if they're going to die from pollution and there's never anyone who dies after the fact.
In that case, I would not need to include any of the lags because if we recall the omitted variable bias formula, we're not going to go through it in detail right now. Essentially what it requires is it requires the omitted variable to both have a relationship with the X variable that you're worried is biased and the outcome of interest. If it doesn't, then it's not an omitted variable. So this question is essentially asking you to think about what are threats to identification and how do I determine whether there's omitted variable bias? This is something maybe you won't be thinking too much about lags or AR three, but trying to identify a treatment effect is essentially the hallmark of modern applied microeconomics. So this type of thinking is super important and it wouldn't shock me at all to see a data task or an interview. Try to gauge your understanding of this stuff.
In the interest of time, I'll keep going to questions two and three and then I'll try to save a couple minutes for any remaining questions and then we can get back into the main session. So question two, we are seeking to answer the question, does exposure to pollution increased mortality? Suppose we had found that there is no effect of lags of pollution on mortality, and we estimated the coefficient in section three. Question one, that question of, sorry, that estimate of 0.5 something, can we estimate this result causally? Why or why not? The short answer is no, simply because I've told you that the lags don't matter. That doesn't tell me anything about other omitted variables about selection. What do I mean by that?
Suppose I tell you that weather really matters for this. That pollution when it's cold doesn't really matter that much, but pollution when it's hot matters quite a bit. But hot weather can also kill people from heat stroke. Now we have a classic omitted variable, something that will predict the outcome will also correlate with the independent variable. If I don't include that, I very well might have omitted variables bias, and so I don't have a causal estimate. Lemme give a more complicated example. Suppose people who are most likely to be affected by pollution are those with poor health. I'm a health person, health economist. We know that health correlates with income. So that might mean that the poorest people might also have the poorest health and be at most risk from pollution. Maybe they have to live in areas that are less desirable. One of the reasons being they're less desirable because they're near high pollutant power plants.
You might be thinking that seems like a stretch. It's definitely not. There's definitely evidence that people do have to live in areas near large power plants are undesirable and that they tend by no fault of their own to have more poor people present because that's the only place they can afford. If that's the case, then there's selection. I'm trying to understand, does pollution kill people? I'm not trying to understand. My question was not okay taking as given how people sort does pollution kill people? If I had done that, then maybe we'd be closer to answering that question. But if I want to know the true effect of pollution on mortality, sorting based on pollution is a problem and it's going to prevent me from causally identifying my estimate. So for this question, it's very open-ended. Those are two answers. There's probably an infinite number of answers you could give. So all we'd be asking for is something along those lines. Something to show you're thinking about threats to identification. Understand that a regression with one out with a control or even no controls and two fixed effects often is not enough.
Final question really quickly is an interesting hypothetical. Suppose instead, the pollution's main effect is that it kills people who are going to die tomorrow. It only kills the most vulnerable people. The question is essentially what happens if I regressed today's pollution on the sum of mortality today and tomorrow? Well, that would actually lead to a coefficient of zero because the question is pollution killing any additional people over this time period? And the answer is no. The people who would've died by the end of tomorrow have died by the end of tomorrow. The people who are alive are still alive, and so pollution does not actually cause mortality over the two day period. But if I had run pollution on today's mortality, I would see a very positive coefficient because we have the people who are supposed to die today. We've added to the people who were going to die tomorrow that we've moved up.
And so that's attributable to pollution. But by that same logic, tomorrow's coefficient, if I ran today's pollution on tomorrow's mortality would be equally negative because I'm now killing less people tomorrow, or at least it appears that way. And what this question is trying to get at is even when you identify a causal estimate, right, it's not wrong to say that pollution has killed people today, and that was your question. You're entirely right. But sometimes the answer is much. The answer in trying to get an actionable estimate or something that you can tell policymakers or write a paper about is much more complicated. Here. We'd have to do something about what is the cost of keeping this person, of killing this person in advance? And there's all this research on statistical life years and things like that that kind of get at that.
So I am sorry we went a little fast through this last section. We're quickly running out of time. Obviously this will all be posted online with this recording as well as with some written answers that are going to be about the same as what I've verbalized here. Does anyone have any questions about this, about data tasks in general and in the next 20 seconds? If not, it looks like the breakout rooms are about to close so everyone can head back to the main session for the panel. Thanks so much guys. Thank you. Thank you. Thank you. Okay, I'm thinking one of the co-hosts may still have a recording going for local. We can go ahead and close that. And let me take.