There is one paper published in 2021, focuses on utilizing Large Language Models (LLMs) for long-form question answering with document retrieval. The authors address the Eli5 benchmark, aiming to generate detailed answers grounded in retrieved documents. While achieving high scores, the authors uncover fundamental issues. Firstly, they find that the quality of retrieved documents has minimal impact, suggesting that model success primarily relies on pre-trained knowledge. Human evaluations confirm this lack of distinction. Secondly, they highlight a systemic problem across benchmarks—significant overlap between training and validation sets. This leads models to inadvertently memorize answers from training data. Lastly, the authors critique the use of Rouge L as an evaluation metric, proposing human evaluations despite their challenges. This paper emphasizes the importance of scrutinizing seemingly impressive benchmark results and advocates for more robust evaluation methods.