This chapter presents the budget estimates and program justifications for the Department of Housing and Urban Development (HUD). HUD’s core mission is to promote adequate and affordable housing, economic opportunity and a suitable living environment free from discrimination. The 2001 Budget for HUD reflects the successful implementation of a multiyear comprehensive reform effort which has helped restore the effectiveness and financial integrity of the Department’s critical affordable housing and economic development initiatives. Building on the success of these reforms, Congress and the Administration have provided significant increases in key HUD programs over the past two years. The 2001 Budget will continue these historic successes by providing the Department with the tools to fulfill its fundamental strategic goals: increasing the availability of decent, safe and affordable housing in American communities (including the enhancement of homeownership opportunities, especially for minorities and first-time homebuyers, the transformation of public housing and the expansion of housing assistance to alleviate severe housing needs); ensuring equal housing opportunity; promoting self-sufficiency and asset development of families and individuals (including moving homeless families to self-sufficiency through locally-developed continuum of care strategies and contributing to the success of welfare-to-work efforts); and improving community quality of life and economic vitality through locally-driven initiatives and programs. The 2001 budget provides increases for two successful block grant programs which serve HUD’s fundamental affordable housing and economic development missions—the Community Development Block Grant (CDBG) and the HOME Investment Partnerships programs. These programs provide states and localities with formula funding pursuant to locally-developed consolidated plans for a wide variety of activities which benefit low and moderate-income families. The Community Development Loan Guarantee program (under Section 108 of the Housing and Community Development Act of 1974) will be continued with level funding and a slightly increased loan guarantee limitation. The Economic Development Initiative will also be maintained. Grants for Urban Empowerment Zones will continue support for ten-year plans to provide new job opportunities and community revitalization in 15 urban areas. The Rural Housing and Economic Development program will be continued at an increased level. HUD’s Homeless Assistance programs will be funded at an increased level to enable communities to continue their development and implementation of comprehensive coordinated continuum of care systems to address the needs of homeless people and families. This funding includes 18,000 rental assistance vouchers designed to provide affordable permanent housing for formerly homeless individuals and families in order to provide a stable living environment, a critical necessity for maintaining access to needed services and providing access to employment opportunities. HUD’s Continuum of Care approach recently received the prestigious 1999 Innovations in Government award from the Harvard University’s John F. Kennedy School of Government and the Ford Foundation. The 2001 Budget maintains the Federal commitment to replacing distressed and obsolete public housing with attractive, mixed-income communities and creating new economic opportunities for residents. Consistent with this commitment and with a special focus on replacing projects that have been determined to be non-viable, the HOPE VI program will receive a significant increase from last year’s enacted level. The 2001 Budget provides $3,192 million for the Public Housing Operating Fund, which helps to maintain good quality housing, and provides a slight increase for the Public Housing Capital Fund, which helps modernize and improve the housing stock. The 2001 budget includes 120,000 incremental vouchers that will help address the severe housing needs of low-income households. These vouchers are necessary to address the continued increase in the number of families nationwide which have worst case needs for housing assistance, including extremely-low income families currently paying more than half their income for rent or living in severely inadequate living conditions. The Administration reaffirms its long-held commitment to renew all expiring Section 8 contracts, to protect residents from displacement by substantially increasing funding for Section 8 renewals, to provide Section 8 tenant-based assistance for displaced families, and for the replacement of affordable housing due to opt-outs from the project-based Section 8 program. The Administration also continues its support for the Department’s successful Housing for Persons With HIV/AIDS program (HOPWA) by providing increased funding to prevent thousands of persons with HIV/AIDS from becoming homeless. This increase is necessary to continue to provide stable housing and services in existing local programs and fund new jurisdictions as they become eligible for formula funding due to the continued increase in the number of AIDS cases. Building on last year’s successful adoption of the Housing Security Plan for Older Americans, the Housing for the Elderly program (under Section 202 of the Housing Act of 1959) will receive a substantial increase, including an increase in capital funding to convert existing housing to assisted living with services and an increase for construction of new affordable assisted living. The Housing for Persons with Disabilities (under Section 811 of the National Affordable Housing Act of 1990) will receive additional funding. In support of the Administration’s strong commitment to increase homeownership opportunities, the 2001 budget includes major support to help lowand moderate-income American families become homebuyers. The budget will increase the Federal Housing Administration’s (FHA) maximum mortgage loan limits, allowing Single Family insurance to cover loans up to the same level as the Fannie Mae and Freddie Mac (GSEs) limits. (Currently FHA can insure home mortgages only up to 87 percent of the GSE limits in high cost areas and only up to 48 percent in low-cost areas, including approximately 2,200 rural counties throughout the Nation). In addition, the budget authorizes FHA to develop new adjustable rate mortgage products. These provisions will provide much-needed assistance to first-time homebuyers, minorities, and other underserved populations. Increased funding for the Fair Housing Assistance and Fair Housing Initiatives programs (FHAP and FHIP) will strengthen the ability of public and private fair housing groups, and partnerships between them, to enforce the laws protecting all Americans against illegal housing discrimination. In order to ensure the effective implementation of its programs, the Department’s Office of Policy Development and Research (PD&R) will be provided with a budget increase. These additional funds are necessary to ensure timely provision of data, research and analysis of national housing and economic conditions, and to measure the performance of pro-

The Department of Housing and Urban Development

Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.

/pdf/the-hitchhiker-s-guide-to-testing-statistical-significance-5azh1weopx.pdf

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate examples. Having only a few workers generate the majority of examples raises concerns about data diversity, especially when workers freely generate sentences. In this paper, we perform a series of experiments showing these concerns are evident in three recent NLP datasets. We show that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators. Moreover, we show that often models do not generalize well to examples from annotators that did not contribute to the training set. Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

Show Your Work: Improved Reporting of Experimental Results

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SciFact will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https://github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps.allenai.org.

/pdf/fact-or-fiction-verifying-scientific-claims-ea27732kle.pdf

Fact or fiction: Verifying scientific claims

Comparing between Deep Neural Network (DNN) models based on their performance on unseen data is crucial for the progress of the NLP field. However, these models have a large number of hyper-parameters and, being non-convex, their convergence point depends on the random values chosen at initialization and during training. Proper DNN comparison hence requires a comparison between their empirical score distributions on unseen data, rather than between single evaluation scores as is standard for more simple, convex models. In this paper, we propose to adapt to this problem a recently proposed test for the Almost Stochastic Dominance relation between two distributions. We define the criteria for a high quality comparison method between DNNs, and show, both theoretically and through analysis of extensive experimental results with leading DNN models for sequence tagging tasks, that the proposed test meets all criteria while previously proposed methods fail to do so. We hope the test we propose here will set a new working practice in the NLP community.

/pdf/deep-dominance-how-to-properly-compare-deep-neural-models-1ogih4tbaw.pdf

Deep Dominance - How to Properly Compare Deep Neural Models

With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions.  In this paper we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging,  cross-domain sentiment classification and word similarity prediction.

/pdf/replicability-analysis-for-natural-language-processing-4cjsglf6is.pdf

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, ...

Statistical Significance Testing for Natural Language Processing

With the ever-growing amounts of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper, we propose a Replicability Analysis framework for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tagging, cross-domain sentiment classification and word similarity prediction.

Rotem Dror

Papers

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Deep Dominance - How to Properly Compare Deep Neural Models

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Statistical Significance Testing for Natural Language Processing

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets