From 8c412f04543667631a0cf2d3ea34afbc62f281dd Mon Sep 17 00:00:00 2001 From: Michael Beck Date: Sat, 13 Dec 2025 17:52:35 +0100 Subject: [PATCH] first working full version --- Makefile | 9 +- README.md | 5 +- _quarto.yml | 27 ++++- custom-reference-doc.docx | Bin 0 -> 8916 bytes deps.R | 1 + index.qmd | 201 +++++++++++++++++++++++++++++--------- 6 files changed, 193 insertions(+), 50 deletions(-) create mode 100644 custom-reference-doc.docx diff --git a/Makefile b/Makefile index 6a398ea..ce32812 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,11 @@ QUARTO ?= quarto +# Allow overriding the list of docx outputs via env var `docx`. +# Example: `make docx docx="index Supplements"` or `make docx docx=index`. +DOCX ?= index Supplements +DOCX := $(if $(docx),$(docx),$(DOCX)) +DOCX_DOCS := $(addsuffix .docx,$(DOCX)) + .PHONY: all pdf docx clean # Build both formats for both documents @@ -7,7 +13,8 @@ all: pdf docx # Aggregate targets pdf: index.pdf Supplements.pdf -docx: index.docx Supplements.docx +docx: $(DOCX_DOCS) +docx-main: index.docx # Pattern rules for either format %.pdf: %.qmd diff --git a/README.md b/README.md index 2ef7d92..cd08c60 100644 --- a/README.md +++ b/README.md @@ -1 +1,4 @@ -This repository contains the quarto project for the article "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning". \ No newline at end of file +This repository contains the quarto project for the article "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning". + +Extensions: +- [kapsner/authors-block](https://github.com/kapsner/authors-block): brings the capability to add an author-related header block when rendering docx-documents with Quarto. \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 1a476e6..2c148ea 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -2,6 +2,27 @@ project: type: default output-dir: _output +lang: en-US + +authors: + - name: Michael Beck + affiliations: + - ref: die + corresponding: true + email: michaeljbeck@proton.me + orcid: 0009-0005-4622-4717 + +affiliations: + - id: die + name: German Institute for Adult Education - Leibniz Centre for Lifelong Learning (DIE), Bonn, Germany +abstract: | + This pilot study addresses the current lack of systematic, large-scale evidence on Open Science Practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, I utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. + | Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields. +keywords: + - Metaphysics + - String Theory +date: 2025-12-14 + format: aog-article-pdf: papersize: a4 @@ -23,10 +44,14 @@ format: lof: false docx: prefer-html: true - toc: true + toc: false toc-depth: 3 lot: false lof: false + reference-doc: custom-reference-doc.docx + +filters: + - authors-block always_allow_html: true diff --git a/custom-reference-doc.docx b/custom-reference-doc.docx new file mode 100644 index 0000000000000000000000000000000000000000..e0d0f77c7dc439aad6e4e598950d81665dfecdda GIT binary patch literal 8916 zcmaKR1yoyW(={H7Tan^k+}+*XU4y&3Yk=bJ?(P)#Vx>^rp_Bpzid%sn?knx>|Gjsz za@NU8_DP=EPxj22ttbNtg#iW&3k%i~9HRsFTfje`8#n=NoEhj}j@2`L(gT7hFq_`F z#_8UTmNX;@MOai96ljEoOW9;mOW8oTiqoS|m69<74Svu#L*HP|IT z+?`Hv*+P55_Vhi#Y`*+m4Gez=V%ZhT)>DV8Ak$RDrxfgkNGjr}5rhi1)xyu4qx>wz zKJpkU*g^&X+b5)gnsdzD(Hscz_OJa?jNC5cWQj8g7Va|fe8Bu4LX#2-0MV{L$ z=R)6`-e)mns1=*%F8P2f$^c+@ovuXppVvPP0S2b{KQn~-{0UQg6GbO`2WJKodnX{h zhpmmTvR*7YF^2E8hO9@fQJzkv>6p0Q^#?gwyuwq&tz(Q`8o}0CfBU?nHtRPAu-91w z?bl!fM>cZjP)wF5EfFq!R=rZSO_?p<4m~!yh2+Nisf;><>9~c3TYt_GK_>F|?|p+D ztRiM>WTR)oX|YIkHpHfEhon*7gKS&MkRYrcEkAec7aH9#hLF8~WNV2FMbp|eUJM&T z&8AD%r5)hUu*xHYfyC66Pji8Knlp=F*W*)6Hu*RRr>$P+Bw*bD5cZM)tB^ntdg*`wriB6J=(~bc)@u-{f21;p0fa z6PE#kf;Zvqd*Ah2O;&hS3uq~E{D!M{8fVaXWGwx1-b~;c)9b!f3DDHYR-$9b`uh@y z2M_VcH*jw0<2m(DFolO`(N*FNMi+GN+dVPA8!Pfic+BhqoRl-^f$7{w{*Msp3h^nz zQYo(Ny(10k$Eq5^sRoHpfARK62?g>52Lt>1kG!G$&YO{g!wYTM%H#3_j3}Mwuj7Z3 zIUS|m#94<3W}&E(@Y>DIud|AuinIuAohX%A3?#2iO?{t`e*e>oRpcs?*ved^XaYVV zKUv=Ly-xhz)o~TEmaO4xgsE+ij}e*1c1UgTId;NuT8vqpqu5>2Px4}=D+{8pQ-%1G z)9r>tL#v_qu6m{9>)f|S(-dQ%f)Y-OCb+qD0qlgXon7<0+HxnSWrpr10|HXiPO%+; zm5)cm588>S<=mgb!2VfnJJKW7vjxLqPSKv$1RC@>GiT-bY^SG z`sckR+4|GIu0q+SvQ{~*MsYotikiIDey(8hhUyAz`uFc#=ai*OIJ3>=`WWsDBO1n{ z3f}rb_3(9v;fpIdbuipi!c&8+`u4lGIryOBUWs5-QE_E&;apd!Jbe)|#)#S_oQvMF zhOL9&y#HVg;uVyS^(X57ixCV14}R5WMr8ky5%k{~F>!TvvA6w;l&N~8AVw6h>9tx5 zyYPZwrjUN)Ed1#%Dgdjcw_GjCjkk%c%QZk>f7W(A@-?~+9Naq~G(>QIUzk*1Vb5zU zlq^^5rosR?H`8y+BF+jJOu#5w_7e!2H0O8CNj7E|88RSj_7ibweRFy$xrE_QZ#omS zc*AHF)Lc`{wq&~_43PNsyT1B=2%PX9`RErrI?aBTYHRQ!5L@$L0V&Tv`tFBdoxY`2kTwa#;FHKDpk{M(~2|c-|5k6)=s)&-D?JBDSFev*A%9h$L zlA$LkE&DUHslBhmMCO~q^*GsKQaJ{gF98Nkr)Bg_xy(1&*oWN{dgc{0+ zR2GY^_;&^ZeX5_k}|wS+ymz=BV8S6LP=WIHW|4pRTa&+%pL_4HNpmqR_!k zg2!Zzk@sQ9tq1==2nhN%mm9EP4bc+uCI0K{2T2*awJP|na&>#muPBL4P>$~pioxPf z*M6!~0ACrkQb1g8n&JDzP}ZkmvlnxE4;-Msq;&pCcw@c*eo?oqzP)?z5L$JK*(6`^@ju6H)$p|}MxLq3>bPa)Gg$Yr@=#U{>bzkV| zst!u)W0p|mj>=L*t(&IfLWG)7L>d%(M{FKMeg@INYbX^&T)|*x+b6=2*Z_O?uv|6* zCyK=-h5a!QgKv1Yo2k9p_wu37aCI*NEE4B64t`F#Qwg>aK#hNf5Ct}B)W zwXgpsVvFP?vtm_#u|U{WU}Hm_#jVikZh$~aXhIhlnnapm`{(=`NzVa7lV)HuA;5UeZNWu3>%t$0b!mVdv@8CX`cgA}^w7`*IGR!qk8zAc%m^n$ssX!|PgdF3^ z1g&;gY*uvlbj%-lR@T5>up*0s6Tz(okp|wNJJsUPiDrc1%dy4Ed5alw+-l)`K;N=2 zcR}1LNy6EoL%%4$TF(Hesl%8TUlCnbEdmYogiEGuxKGd_okup#@h<7wV5d_47g5BfF^W zJ8kGj%JvzyvhHze{|VVpT_Z-SEmhb^oN=20Xk1=7np=j+Muv$y45RQt8G%QN>6mt? zjBe-#TRHk#hUtK8NX?vT*p{m19r4hdAJuSnQJPE2b_Lzg7i?wOD;dW*DG*0?aavQc zhcg1z`u9|0r!)-rH27OEPD!q}4d|RSR>JZFWeb-RH#tM&T-sUv^%_c!7^yk4KV3?2 zHu*+8aa!PR#0C9H2*4wqftuI*1jF;f_jllrQ!KoVz(CHvi%(l#X{Yy#?(TpepEB@< zyDirbZila}w;hhL6O*Kx*b3a3S}6<^gE$Km1>kB{$yerd!druf4}Fo8S)oIvPqX{e z$PxJGgIjhcGAqd@_eBf#Ij4Vk5@X}U-zkfT72zjDiiv;8R4ETRuwb12nChhUHvH<< zJ%dFxtn>jhPP}&)iN+;Md3_{SUilMLN>!4aS`gM-JL=MixR1SxCPU+#V}<3HFp1^| zq!i@i1}%G>lWt%aI;JFNk@XU#aA?iC{zhXCa}B8HkR$bX=}veGI7|d#JC$CutJFDF zLYgv`c?UINtV`br?d$H^J~q||LuJ@|U)#ovvphg-un>^avk%VFs|6(y0ZF?TMh3mU zS;EAj4vVyws5)y(cy|zU7Gq8tavU94UT$M#R#esEJG_3U#RgP$5_bvQ&X)(Mq2z}+ zVsxZt5K5A5-+dhlv`X#a5ieGxfH21E;0QGp7Bb~ft=b?Z`V!M?SPcuBW;2Y9-~hKm zOj2*@t<7Z_x-MU&Clzd507mHhWZQOFfk^OnrlEftyES%`2(^(rA9B@=VZ&BI-$#hQmLZdi`Xmu0W$!TzYy-MoS)?lEYw)O4e!&w z>*KhMLqFjS%v+r}&fwAh>fYY8!z7Nf7m(&*k?t>4@R&MEYX@e=VtU5N9smgGRT&=Vf62VGPB9c00)fLYv8qZ~h^o9By_eeCA41lD54`tl;XZ zq`zoAPSKLjn7PX>NB)n(JG*$=0G(fo<&mzfb0H_%qhC+Z_I>mQ{tDZK`gRqKO7i$b42yRzbW;pF zkz+!{&I6r}tAN?lV)9^AUjSm7Bie>~g_&$+sLlg#Q<3_PLQkhh%@snli8Mmjb$z9^ zAgz^21N#MGY!ouO-VOvRZhKNdzJ1_fkI&;s>11?NHN39(scyeHu84cc5A$rKR=gi5 zPm;bw3_0%$CPae#i)_}G6wU6VwR^cJtUn%IykB!15lQ=YYU?$y8pCVDG`278(^R&Q zU-`O4aIIqYpWeV^znf*6msVTfC5ueK=WYu{Z=8nf$fk-5a;gxDnzP43*y;Ebv@ z!c0WyN89+7k_3HaaT@i$t}_LnCtiX3KIiPZ_A?fs`(wcQq2ToU>Ju=j?HZ z6T8GH0nt%VDD?>G?deN(9i{W=i!y;w*j^fMtOpK8Qtc5`!iL-^b%MfA>TDsU_E7;V zKkNac$VgZ6kF}vBDr9vQf+6|WwZk0^RJI{+@_rWCni(Gk=if)pzqG3#+MCJWF4t3(})S?mJ2v^#}r8BW4B=B)b(T&6$P@IL| zG``eF?$qMXIfXeg3G=D`5_>ThLv}9_^M;jnceYNCA8`Xcm%@@8QD&c&lQP@k<|M0C zxCY&lg}QjMOm8`wK8}6h8Y5JmY^*Km?T%X{v!8+}PD!NlSmLJ@I0guP96hjxK9+fB0f}yp&sS)!X1v!?=C`CxoyE1~*#ss6Ob&(M?VqVu?*oh1X(4>Abb~<1E2%f;Vh{6nNvOL4uXqS%&m_sOc=8c<93u476wNlZ}(#bxGn z8C#)0q!~$I7f5{MV3*yzQ_;qnAB(9pL?)@Q1LhM%Ts~zJnW{`QRlJGqP zaM&pDP9~aSZ{EnLP8lt(%>yWqh0i|mS(bmD9GxZ|KVh6~O1rSLQ((H+FeY@P41qJh zGBZ`DQo3Hr2r2bSOk4g8S)(tT4apb|5wFac-vG#*=f22g*P5b~%ssCKy{$n1>+ z@ym$LGy_Yfr1W=*Ex7x=C(a91`bAL5T(Py~Lq+P?*xcXwNM&aYsgnsnzEL9p`1GUr zLzW05_H|l7CpO{ZT75DsNG<&W?%MV?4mb6jq^&pl1T*xtI#b1fou_VKDx0y9jKU|~ zZd{8)fUwt zPe@KR&k0N>eTj7UE1QCCPxdsfJ0q*l*^o|F&jIN-zZad$;7oGSTbio7Y@$ z-=XyZRbvnw`}tyeAlVIEO~jn$p|{2ZGA?DgxDY8Z+84`m`49J$2-W;E27ACTKEG^h zcff?SUw_F}g2l8$1C1PJOo`XNrQ|I#p)3O(hbe^&)0eXwtwp+Hl^L^UpWssqrfSV` zO4A=3G?k;1jS#GU;vdz&3kqFb*on`3^K^9-d;JKd8@UyOz- zFD@U-!!65ELtT#DS2nBNB64e)^p6ipzCP=widHR6o~ZLL#oc8t#(p;r^3vo;TI&7$ zF_AnhRH81eN~CnGR`QU~{GHW+exyFs$y5^S@+sxxt^Z38q>%ECas(C(%wF-IdLXp_ z^gw3z_AYkzF8^wSiW5INtq`FMpZLOsy{G$-=)(P`P?vfCV!nJ9(Cb0Y0-XYG`kmhw zp!Jl+U7e^a7__$0x&d(*Ua0;BY)XbT8go1gjS<3qhBf-p$4}cQe-L2jrkf8guCTc; za%qV_{)Pz~Ve^9$&5n#|lWWO^F>=OzXi9(n2sbU%0otK9+&9W)b58MV;zj21yCQOJ zMu}=GmkdCU=z{59k_1LI40?lFj{g+iTLQ8-J*o4^O#?j_$l8dGr*#2ny38h}HI*jH5O}H^wt@(~4yI1b^19U#j>BQX`uGa_gf50_$P4 z7JZSWByFo=VQ@EYC(wY2>1IdSN73L*`;DpX?StnpUx4!Y!snTw59#$ame{|!~DM| zOuuauI~P?WW1GKb&|L|AcKwV9VJF_8x7YFgMj0s3aw0r(cre~zD!Nuw^_8537e|!Z zaTLRWauGVFJWuBL*QPDV>)#ovU+uYZj#dfLn~HQ!9_<|*oi4qR>&5-ipSFTIH&ue1 zXr>EC$#%b~H6mwMnRHO^tgVVI2v8rkRbB!64!!H(8+qV5-mjqOx2dn2B z8?-hpqAl-j$1nirx+Fp6BSV`9SvAj^xR>CtKusKhAS$#BP58;c+@Zywl};5KA#+z~ z8*Af>JUF3Ucy%r<7M-pon>vUBb z-e9J4dXD9s0U@Kup_$eSZ)9@PM{}xl{P?63mHXCtH%xNY1q>L=IY!BuwYt`tw81@* z;-}Pmy76-Pt;~z1s5U8jpf=WI;2_&i2uYYcM6$DgG&%dpXN37 zt;mKmH*<_DI-M^GZ9lqvsrhIGMvu&QQR6%lM~#it6q!ujenp*w1>7TOS}zKWAiZd{ z7pyBYp|eeelDEQYUAN(G*6y-R>gLcDFCa^_gonX1NQ{wp)Iin&tLiF8D+1c@3s=Fb z*thOdV1%EF#U#-mLrCbKiu-Q!AsL=1l;;M2&_96e%$sg|D$Mb+dG*&CoYWAxpenIg zbBsYw4^!Yx6=q|nYgCMP6QSNq9>Zbx{9bMNChV(T>&Np*1Z+UC5YM%Z)#|UzxO`o1 zjqtkh4CLIEd*3ibn!My3pK!tjUU(T~5GK%i=0*P>r^v7Wqna)jKwBWgUq_~w`^TYF zTl-8#l#z`S6-~b$V@Y8m>g*ENPx>|)O>Tsn5CD%$$bS75pZixL5Kzl%Rvm#!iz6SO z-k%S|>-u>l1+ALcwyD3mWWXjbQ_gl&BfHo;y7Co`tj}Xc(KZskw)hra>7H6wPcBUs ztNgYxFeGJO9rq)PWhB}}XpJJDbo`-q5F1tpeP>k|pazbrGoXGtvG?`S@B#x)?Ri$I zNA>ImTC%N}6^3EEddZjMe9^$c5I=XKc3X%djc`Q)c$?w$;W*G;iDO71>buF~pI^s6 z4;?3#ec+1-VRlzKIhf&r=$CI_)x%!^U(#co8z9oYli&lu^6WBN9T(<663Hn|-dt%A zM+LpZ7qoR2=h459ViK4(kiFb-z@nAt3|{`~lwm?(((>jM$ttt*6*BGOP!#G(UJH)= z4S7Fz@AIj$gDXezfGb8QzOSegUH8PcBX)!9{1)FXqMJOA{!-~xe278Ex)pG3J7^cQ z=o`valr|lkJQt?xtnzA6MLBQ=27L@-!#tW!%b_zhNp-bk7n2O@K7?6%2F)s%J>L${ z2V<1CnxX7$d?pW(p!-1H;s08VFoGjXk)z9LaZ7l0v8%Us7}~aJb14OWiyMiD-ayR{ zjm;G_LTPKySyq;BpDEwMSTs}e;Tw+LV#LN-=2S=G7GVXX4Xs8?bJ$_GX>{=0iRn*c zh<@!XEg6n2`kfQ$Q5SRTBu`xS>xS&m8c-P=AFP(VC@7!fn8B*O7R#UF+Nj$Fy1{MSg{dRaufG zm^`f;P)wGrmZDe)CoPCUonLO-SZ!re+m~c(cC6bPFJ)Ix5i}mczE6OTS>e6|u)b3G zIrv@f{8Zi?^&C&-KBV}>kFDi(y04etVrG<$hRsL)8G`~w?&I*YD>r*kVB^qgel&9U zS^3$;oir;faRy=_A8J3&26c%$7C+7XJn);#3R`mdQ}&kJBP|}ShW~R30l@OFezSml zp03j%{*$`BcriL6_I57MPKAN0rvuPg@8v(#mCR(n^6XVy|AY&)CjNA+h!OY&!IGzi zV71+FG;t!SlXSGP^5=a}f+19(F^@x5DCEidJ;4dgynC1X5P@++NQ|aByb_odGv<5o zpX)#0_JH2c$#r?c#evp&WMdYY!*!?<*Ml_53&0k=7nCnRDq{7CLi;@xkhh-oHzaJt zkWL?#M5+$#rz^5|w!J?FkhWgSV_cgUh_eZYZdR$=yaLvYY3V_wzJtBC&@ZAXHC2P+ zxG1Vd+@C!R30Q*v-4Z?ml;0xT@Dq>n| z!zDe_ZPaPuWcm+(jLD{@V4;&W2^&vAfpGncgYTFmmXaTg_w-!LJb?NXlz!MBRdpbt% + flextable() %>% + set_table_properties(width = 1, layout = "autofit") %>% + theme_booktabs(bold_header = TRUE) %>% + align(align = "center", part = "all") %>% + fontsize(size = 11, part = "header") %>% + fontsize(size = 10, part = "body") %>% + set_header_labels( + step_id = "Step #", + step_label = "Step", + n_before = "Before", + n_after = "After", + n_dropped = "Dropped" + ) } else { - + } + if (isTRUE(debug_mode)) { debug_info[[knitr::opts_current$get("label")]] <- if (knitr::is_html_output()) "HTML" else "LaTeX" @@ -232,6 +253,7 @@ The final analytical sample is made up of 4265 publications. The OS prevalence c #| fig-cap: "Frequencies: Publications by Year in Population and Sample" #| label: fig-freq-pubs-comp #| fig-height: 6 +#| fig-width: 8 #| fig-pos: H meta_final <- qs_read(file_meta_final) @@ -329,7 +351,14 @@ p4 <- sample_B_by_year %>% labels = function(x) ifelse(x %% 2 == 1, x, "") # use modulo ) -print((p1|p2) / (p3|p4)) +if(output_format == "pdf/tex") { + print((p1|p2) / (p3|p4)) +} else if(output_format == "docx") { + print((p1|p2) / (p3|p4)) +} else { + +} + if (isTRUE(debug_mode)) { debug_info[[knitr::opts_current$get("label")]] <- if (knitr::is_html_output()) "HTML" else "LaTeX" @@ -367,15 +396,32 @@ tbl_cases2 <- tbl2 %>% select(step_id, step_label, n_before, n_after, n_dropped) if(output_format == "pdf/tex") { -tbl_cases2 %>% - kable( - format = "latex", # force LaTeX output (not markdown) - booktabs = TRUE, - longtable = FALSE, # avoid longtable entirely - col.names = c("Step #", "Step", "Before", "After", "Dropped")) + tbl_cases2 %>% + kable( + format = "latex", # force LaTeX output (not markdown) + booktabs = TRUE, + longtable = FALSE, # avoid longtable entirely + col.names = c("Step #", "Step", "Before", "After", "Dropped") + ) +} else if(output_format == "docx") { + tbl_cases2 %>% + flextable() %>% + set_table_properties(width = 1, layout = "autofit") %>% + theme_booktabs(bold_header = TRUE) %>% + align(align = "center", part = "all") %>% + fontsize(size = 11, part = "header") %>% + fontsize(size = 10, part = "body") %>% + set_header_labels( + step_id = "Step #", + step_label = "Step", + n_before = "Before", + n_after = "After", + n_dropped = "Dropped" + ) } else { - + } + if (isTRUE(debug_mode)) { debug_info[[knitr::opts_current$get("label")]] <- if (knitr::is_html_output()) "HTML" else "LaTeX" @@ -406,9 +452,7 @@ Data is reported per year. As per year data given the very low prevalences is ex Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates. -The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. - -In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time. +The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity. For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time. Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, Preregistration), these reflect survey-design uncertainty only. @fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 Preregistration) estimates were not possible. @tbl-osp-prev-overall confirms low prevalences across the full period: OA $40.9\%$ (38.8-43.1), OM $4.3\%$ (3.4-5.3), Preregistration $3.6\%$ (2.8-4.5), and OD $2.2\%$ (1.6-2.9). @@ -467,16 +511,26 @@ tbl_sample_desc <- df %>% mutate( footnote = "A: Psychology, Multidisciplinary; B: Law; C: Criminology & Penology", columns = "label", rows = variable == "journal_category" - ) %>% as_gt() %>% - tab_options( - table.font.size = gt::px(12), - latex.use_longtable = TRUE - ) + ) if(output_format == "pdf/tex") { - tbl_sample_desc + tbl_sample_desc %>% + as_gt() %>% + tab_options( + table.font.size = gt::px(12), + latex.use_longtable = TRUE + ) +} else if(output_format == "docx") { + tbl_sample_desc %>% + as_flex_table() %>% + set_table_properties(width = 1, layout = "autofit") %>% + theme_booktabs(bold_header = TRUE) %>% + align(align = "center", part = "all") %>% + fontsize(size = 11, part = "header") %>% + fontsize(size = 10, part = "body") %>% + width(5, 1) %>% + height_all(height = .2) } else { - #tbl_sample_desc %>% as_kable() } if (isTRUE(debug_mode)) { @@ -485,10 +539,13 @@ if (isTRUE(debug_mode)) { } ``` +In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition. + ```{r} #| fig-cap: OSP Adoption Over Time, among statistical inference papers (design-weighted) #| label: fig-osp-adoption #| fig-pos: H +#| fig-width: 7 # ensure that types match df <- df %>% mutate(published_year = as.integer(published_year)) @@ -635,9 +692,7 @@ if (isTRUE(debug_mode)) { } ``` -In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition. - -Because design-based estimates do not account for classifier error, Rogan-Gladen adjustments were applied using sensitivity and specificity from the ML-validation analysis (@tbl-osp-prev-overall) [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata). +Because design-based estimates do not account for classifier error, Rogan-Gladen adjustments were applied using sensitivity and specificity from the ML-validation analysis in @tbl-osp-prev-overall [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata). ```{r} #| tbl-cap: Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals) @@ -680,27 +735,38 @@ overall_osp_si <- overall_osp_si %>% ) %>% arrange(desc(`Prevalence`)) +tbl_overall_osp_si <- overall_osp_si %>% + kbl( + format = 'latex', + longtable = TRUE, + booktabs = TRUE, + escape = T, + ) %>% # add footnote + column_spec(1, width = '3cm')%>% + kable_styling( + position = "center", + latex_options = "hold_position", + full_width = FALSE) %>% + kableExtra::footnote( + general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", + general_title = "Note:", + footnote_as_chunk = T, + threeparttable = T + ) + + if(output_format == "pdf/tex") { + print(tbl_overall_osp_si) +} else if(output_format == "docx") { overall_osp_si %>% - kbl( - format = 'latex', - longtable = TRUE, - booktabs = TRUE, - escape = T, - ) %>% # add footnote - column_spec(1, width = '3cm')%>% - kable_styling( - position = "center", - latex_options = "hold_position", - full_width = FALSE) %>% - footnote( - general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", - general_title = "Note:", - footnote_as_chunk = T, - threeparttable = T - ) + flextable() %>% + set_table_properties(width = 1, layout = "autofit") %>% + theme_booktabs(bold_header = TRUE) %>% + align(align = "center", part = "all") %>% + fontsize(size = 11, part = "header") %>% + fontsize(size = 10, part = "body") %>% + set_caption(caption = "Note: Prevalence estimates in statistical inference publications using design-weights per year (95% CI)") } else { - } if (isTRUE(debug_mode)) { @@ -709,6 +775,8 @@ if (isTRUE(debug_mode)) { } ``` +Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years. + ```{r} #| tbl-cap: Observed and Adjusted Prevalence of Open Science Practices among Statistical Inference Papers #| label: tbl-osp-prev @@ -818,7 +886,7 @@ if(output_format == "pdf/tex") { kbl(format = "latex", booktabs = TRUE, escape = FALSE, align = c("l","l","l","r","r","r","r"), longtable = TRUE) %>% kable_styling(latex_options = "hold_position") %>% - footnote( + kableExtra::footnote( number = c( "Sensitivity", "Specificity", @@ -827,16 +895,53 @@ if(output_format == "pdf/tex") { ), escape = FALSE ) +} else if(output_format == "docx") { + colnames(osp_table_pretty) <- c( + "OSP", "Obs. (95% CI)", "Adj. (95% CI)", + "Se", "Sp", "Pos", "Neg" + ) + osp_table_pretty %>% + mutate( + across(where(is.character), + ~ stringr::str_replace_all(.x, "\\\\([%&_{}#])", "\\1")) + ) %>% + flextable() %>% + set_table_properties(width = 1, layout = "autofit") %>% + theme_booktabs(bold_header = TRUE) %>% + align(align = "center", part = "all") %>% + fontsize(size = 11, part = "header") %>% + fontsize(size = 10, part = "body") %>% + footnote( + i = 1, + j = 4:7, + value = as_paragraph( + c( + "Sensitivity", + "Specificity", + "Number of positive cases in validation set", + "Number of negative cases in validation set" + ) + ), + part = "header", + ref_symbols = c("a", "b", "c", "d") + ) %>% + fontsize(size = 9, part = "footer") %>% + width(j = 1:3, 1.85) %>% + colformat_double( + big.mark = ",", digits = 2, na_str = "N/A" + ) + + } else { - + } + if (isTRUE(debug_mode)) { debug_info[[knitr::opts_current$get("label")]] <- if (knitr::is_html_output()) "HTML" else "LaTeX" } ``` -Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years. This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification. @@ -854,8 +959,8 @@ Despite of all the limitations, there are main substantive implications: OSP pre ```{r} #| label: fig-osp-time-by-publisher -#| fig-width: 10 -#| fig-height: 10 +#| fig-width: 9 +#| fig-height: 9 #| fig-cap: Open Access by Publisher over Time. #| fig-pos: H @@ -965,7 +1070,7 @@ grid_publishers <- ggplot( ) + labs( x = "", - y = "% of articles Open Access", + y = "% of Open Access articles", title = "", caption = paste0( "Top 12 publishers by sample n.\nWithin-year proportions from stratified-by-year sample.\n", @@ -1056,5 +1161,7 @@ The authors declare no conflicts of interest. if (isTRUE(debug_mode)) { print("# Debug Info") print(debug_info) + + print(paste0("Output Format set to **", output_format, "**")) } ``` \ No newline at end of file